Data Cascades

Definition

Compounding events causing negative downstream effects from data issues, resulting in technical debt over time. Small data problems early in the ML lifecycle propagate and amplify into major system failures.

The Cascade Pattern

Trigger: A seemingly minor data issue (labeling inconsistency, collection bias, missing documentation)
Propagation: Issue flows downstream through data pipelines, model training, evaluation
Amplification: Compound effects as multiple cascades interact
Manifestation: Model failures, costly rework, harm to users, often months or years later

Four Cascade Triggers (Sambasivan et al. 2021)

Physical World Brittleness: Models trained on clean data fail on noisy real-world inputs
Inadequate Domain Expertise: ML teams lack understanding of data’s meaning and context
Conflicting Incentives: Organizations reward model work over data work
Poor Documentation: Lack of metadata prevents understanding data limitations

Prevalence

92% of AI practitioners experienced at least one data cascade. 45.3% experienced two or more per project. Cascades are pervasive, invisible, delayed, but largely avoidable.

Prevention

Early investment in data quality
Close collaboration with domain experts
Documentation as first-class deliverable
Feedback loops from deployment to data collection

>heyMHK

Data Cascades

Data Cascades

Definition

The Cascade Pattern

Four Cascade Triggers (Sambasivan et al. 2021)

Prevalence

Prevention

Properties

Graph view

Table of Contents

Backlinks