Data Excellence
The Principle
Data quality in AI requires proactive focus on the practices, politics, and values of humans in the data pipeline, not just reactive technical fixes.
Why This Matters
Current approaches to data quality are reactive: clean the data after it’s collected, detect anomalies in the pipeline, monitor for drift in production. These are necessary but insufficient. They address symptoms while leaving root causes intact.
The root causes are structural: incentive systems that reward model work over data work, training that ignores data creation, and organizational relationships that treat data collection as outsourced operations rather than strategic partnerships.
Data excellence means treating care, sanctity, and diligence in data as valuable contributions to the AI ecosystem, as valuable as novel model architectures.
How to Apply
Shift incentives. Reward data work in promotions, peer reviews, and publication decisions. Make dataset creation and pipeline maintenance as career-advancing as model innovation.
Change training. Teach data collection design, documentation practices, and collaboration with domain experts alongside model building. Use messy real-world data, not just clean benchmarks.
Invest in partnerships. Treat domain experts and data collectors as collaborators throughout the development process, not just consultants at the start or troubleshooters at the end.
Build early feedback loops. Teams with the fewest cascades maintain tight feedback throughout development, running models frequently, working closely with domain experts, monitoring incoming data continuously.
When This Especially Matters
High-stakes domains amplify the consequences of data cascades: healthcare predictions, credit decisions, public safety systems. But the principle applies broadly. Any AI system where the gap between training and deployment matters, which is most of them, benefits from treating data as a first-class concern.
The Hard Part
Individual practitioners can adopt data excellence practices. But sustainable change requires structural shifts that individuals can’t make alone: conference norms, organizational incentives, educational curricula, and industry standards.
The challenge isn’t awareness. It’s that the field’s current reward systems actively discourage the care that data quality requires.
Related: 04-molecule—data-cascades, 05-atom—model-valorization, 04-molecule—four-cascade-triggers