Data Quality Dimensions Consensus Gap
Despite decades of research, there’s no universal agreement on what data quality dimensions exist or how to measure them. Different frameworks emphasize different aspects.
Common Dimensions (with variations)
Accuracy: Does data reflect reality? (but whose reality?) Completeness: Are all required values present? (but required for what?) Consistency: Do related values agree? (across what scope?) Timeliness: Is data current enough? (for which decisions?) Validity: Does data conform to rules? (whose rules?) Uniqueness: Are duplicates eliminated? (what counts as duplicate?)
Why Consensus Eludes
- Data quality is context-dependent (quality for what purpose?)
- Dimensions overlap and interact
- Measurement requires operationalization, which introduces choices
- Different domains have different priorities
Practical Implication
Don’t search for the “right” framework. Define quality dimensions based on your use cases and stakeholders. Be explicit about what you’re measuring and why.
AI-Specific Dimensions
Training data may need additional dimensions:
- Representativeness: Coverage of target distribution
- Label Quality: Annotation accuracy and consistency
- Provenance: Source and transformation history
- Bias Indicators: Demographic and temporal balance
Related: 04-atom—data-governance, 00-source—sambasivan-2021-data-cascades