Data Quality Dimensions Apply Differently to AI Training Data
Traditional data quality frameworks emerged from database and business intelligence contexts. When applied to AI training data, the dimensions take on different weight and meaning.
Dimensions that become more critical:
- Societal relationships: Bias, provenance, and diversity, marginal concerns in transaction processing, become central for AI. Training data shapes model behavior, so societal-level quality failures propagate into deployed systems.
- Representativeness: A variant of completeness specific to AI, does the data cover the input distribution the model will encounter?
Dimensions that change meaning:
- Accuracy: In traditional contexts, accuracy means matching reality. For AI training data, accuracy also includes label quality, how reliably human annotators applied labels.
- Completeness: Beyond missing fields, completeness for AI involves coverage of edge cases and minority classes.
Dimensions that matter less:
- Timeliness: Critical for analytics dashboards, but many AI models are trained on historical data where currency matters less than distribution coverage.
The implication: AI data quality requires its own framework, informed by but not identical to traditional data quality dimensions.
Related: 04-molecule—dq-contextual-relationships, 04-atom—five-core-dq-dimensions