The Data Cascades Practitioner Guide

Why 92% of AI Projects Encounter Compounding Data Problems - And How to Prevent Them


Google researchers interviewed 53 AI practitioners building systems for healthcare, conservation, and financial services. The finding that should concern every data leader: 92% experienced data cascades - compounding problems that originated in data but manifested as system failures weeks or months later.

The title of their paper says it all: “Everyone wants to do the model work, not the data work.”

This guide translates their research into actionable prevention strategies.

What Data Cascades Are

A data cascade is not a single data quality issue. It’s a compounding pattern:

  1. Origin: A seemingly minor data problem (inconsistent labeling, undocumented assumptions, incomplete collection)
  2. Propagation: The problem flows downstream through pipelines, training, evaluation
  3. Amplification: Effects compound as multiple issues interact
  4. Manifestation: System failures, costly rework, or harm to users - often far removed from the original cause

The delay between cause and effect makes cascades particularly dangerous. By the time the problem is visible, it has propagated through multiple systems. Fixing it requires tracing back through layers of dependencies.

Four Root Causes

The research identified four primary triggers:

1. Physical World Brittleness

Models trained on clean, curated data encounter noisy real-world inputs. The gap between training conditions and deployment conditions creates cascades.

Example: An eye disease detection model trained on laboratory-quality images fails when deployed with images containing dust specks, poor lighting, or patient movement.

Prevention: Include realistic noise and variation in training data. Test explicitly on degraded inputs. Design for graceful degradation when input quality drops.

2. Inadequate Domain Expertise

ML teams without deep domain knowledge make decisions that seem reasonable technically but don’t reflect how data relates to the real world.

Example: A model predicting crop yields uses satellite imagery without understanding seasonal patterns, soil variation, or farming practices that affect interpretation.

Prevention: Embed domain experts in data teams. Create feedback loops from deployment back to data definition. Document domain assumptions explicitly.

3. Conflicting Incentives

Organizations reward model work (papers, demos, metrics) while treating data work as low-status maintenance. This creates systematic underinvestment in data quality.

Example: Researchers clean data just enough for publication benchmarks, leaving undocumented issues for production deployment.

Prevention: Make data quality a first-class deliverable. Reward data work equivalently to model work. Track data quality metrics alongside model performance.

4. Poor Documentation

When data context isn’t documented, downstream users make wrong assumptions. The assumptions propagate into models that encode those errors.

Example: A dataset labeled by contractors under time pressure has systematic biases that aren’t documented. Models trained on it inherit the biases.

Prevention: Require documentation as part of data delivery. Create data sheets for datasets. Make provenance traceable.

The OODA Framework for Prevention

Applying the OODA loop to data governance creates a continuous prevention cycle:

OBSERVE: What data exists? What are its characteristics? Where does it come from?

  • Implement data cataloging
  • Track quality metrics
  • Monitor for drift

ORIENT: Can I trust this data? What are its limitations? Where are the risks?

  • Assess quality dimensions
  • Identify gaps and assumptions
  • Evaluate fitness for use

DECIDE: What should I prioritize? Which problems matter most?

  • Rank issues by downstream impact
  • Plan remediation
  • Allocate resources

ACT: How do I improve data quality? How do I prevent cascades?

  • Implement fixes at the source
  • Strengthen documentation
  • Build prevention into processes

This isn’t a one-time assessment. It’s a continuous loop that catches problems before they cascade.

Warning Signs

Watch for these indicators that cascades may be developing:

  • Data quality issues discovered late in projects
  • Model performance that doesn’t transfer to deployment
  • Unexplained degradation over time
  • Domain experts surprised by system behavior
  • Documentation that doesn’t match actual data

The Investment Case

Preventing cascades requires upfront investment in data quality, domain expertise, and documentation. This investment feels expensive until you compare it to cascade costs:

  • Projects abandoned after significant investment
  • Costly post-deployment fixes
  • Harm to users from system failures
  • Reputation damage from visible failures

The Sambasivan research found cascades are “largely avoidable through intentional practices.” The cost of prevention is typically far less than the cost of recovery.


What data quality problems might be compounding in your systems right now? Who would notice before they become visible failures?

Related: 04-molecule—data-cascades-concept, 00-source—sambasivan-2021-data-cascades, 04-atom—data-governance