The Data Cascades Practitioner Guide

Why 92% of AI Projects Encounter Compounding Data Problems - And How to Prevent Them

Google researchers interviewed 53 AI practitioners building systems for healthcare, conservation, and financial services. The finding that should concern every data leader: 92% experienced data cascades - compounding problems that originated in data but manifested as system failures weeks or months later.

The title of their paper says it all: “Everyone wants to do the model work, not the data work.”

This guide translates their research into actionable prevention strategies.

What Data Cascades Are

A data cascade is not a single data quality issue. It’s a compounding pattern:

Origin: A seemingly minor data problem (inconsistent labeling, undocumented assumptions, incomplete collection)
Propagation: The problem flows downstream through pipelines, training, evaluation
Amplification: Effects compound as multiple issues interact
Manifestation: System failures, costly rework, or harm to users - often far removed from the original cause

The delay between cause and effect makes cascades particularly dangerous. By the time the problem is visible, it has propagated through multiple systems. Fixing it requires tracing back through layers of dependencies.

Four Root Causes

The research identified four primary triggers:

1. Physical World Brittleness

Models trained on clean, curated data encounter noisy real-world inputs. The gap between training conditions and deployment conditions creates cascades.

Example: An eye disease detection model trained on laboratory-quality images fails when deployed with images containing dust specks, poor lighting, or patient movement.

Prevention: Include realistic noise and variation in training data. Test explicitly on degraded inputs. Design for graceful degradation when input quality drops.

2. Inadequate Domain Expertise

ML teams without deep domain knowledge make decisions that seem reasonable technically but don’t reflect how data relates to the real world.

Example: A model predicting crop yields uses satellite imagery without understanding seasonal patterns, soil variation, or farming practices that affect interpretation.

Prevention: Embed domain experts in data teams. Create feedback loops from deployment back to data definition. Document domain assumptions explicitly.

3. Conflicting Incentives

Organizations reward model work (papers, demos, metrics) while treating data work as low-status maintenance. This creates systematic underinvestment in data quality.

Example: Researchers clean data just enough for publication benchmarks, leaving undocumented issues for production deployment.

Prevention: Make data quality a first-class deliverable. Reward data work equivalently to model work. Track data quality metrics alongside model performance.

4. Poor Documentation

When data context isn’t documented, downstream users make wrong assumptions. The assumptions propagate into models that encode those errors.

Example: A dataset labeled by contractors under time pressure has systematic biases that aren’t documented. Models trained on it inherit the biases.

Prevention: Require documentation as part of data delivery. Create data sheets for datasets. Make provenance traceable.

The OODA Framework for Prevention

Applying the OODA loop to data governance creates a continuous prevention cycle:

OBSERVE: What data exists? What are its characteristics? Where does it come from?

Implement data cataloging
Track quality metrics
Monitor for drift

ORIENT: Can I trust this data? What are its limitations? Where are the risks?

Assess quality dimensions
Identify gaps and assumptions
Evaluate fitness for use

DECIDE: What should I prioritize? Which problems matter most?

Rank issues by downstream impact
Plan remediation
Allocate resources

ACT: How do I improve data quality? How do I prevent cascades?

Implement fixes at the source
Strengthen documentation
Build prevention into processes

This isn’t a one-time assessment. It’s a continuous loop that catches problems before they cascade.

Warning Signs

Watch for these indicators that cascades may be developing:

Data quality issues discovered late in projects
Model performance that doesn’t transfer to deployment
Unexplained degradation over time
Domain experts surprised by system behavior
Documentation that doesn’t match actual data

The Investment Case

Preventing cascades requires upfront investment in data quality, domain expertise, and documentation. This investment feels expensive until you compare it to cascade costs:

Projects abandoned after significant investment
Costly post-deployment fixes
Harm to users from system failures
Reputation damage from visible failures

The Sambasivan research found cascades are “largely avoidable through intentional practices.” The cost of prevention is typically far less than the cost of recovery.

What data quality problems might be compounding in your systems right now? Who would notice before they become visible failures?

>heyMHK

The Data Cascades Practitioner Guide

The Data Cascades Practitioner Guide

What Data Cascades Are

Four Root Causes

1. Physical World Brittleness

2. Inadequate Domain Expertise

3. Conflicting Incentives

4. Poor Documentation

The OODA Framework for Prevention

Warning Signs

The Investment Case

Properties

Graph view

Table of Contents