Data Cascades

Definition

Compounding events causing negative downstream effects from data issues, resulting in technical debt over time. Small data problems early in the ML lifecycle propagate and amplify into major system failures.

The Cascade Pattern

  1. Trigger: A seemingly minor data issue (labeling inconsistency, collection bias, missing documentation)
  2. Propagation: Issue flows downstream through data pipelines, model training, evaluation
  3. Amplification: Compound effects as multiple cascades interact
  4. Manifestation: Model failures, costly rework, harm to users, often months or years later

Four Cascade Triggers (Sambasivan et al. 2021)

  1. Physical World Brittleness: Models trained on clean data fail on noisy real-world inputs
  2. Inadequate Domain Expertise: ML teams lack understanding of data’s meaning and context
  3. Conflicting Incentives: Organizations reward model work over data work
  4. Poor Documentation: Lack of metadata prevents understanding data limitations

Prevalence

92% of AI practitioners experienced at least one data cascade. 45.3% experienced two or more per project. Cascades are pervasive, invisible, delayed, but largely avoidable.

Prevention

  • Early investment in data quality
  • Close collaboration with domain experts
  • Documentation as first-class deliverable
  • Feedback loops from deployment to data collection

Related: 00-source—sambasivan-2021-data-cascades, 04-atom—data-governance, 04-molecule—ooda-data-governance