Sambasivan et al. 2021 — Data Cascades in High-Stakes AI

Citation

Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., & Aroyo, L. (2021). “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. CHI Conference on Human Factors in Computing Systems (CHI ‘21), Yokohama, Japan. ACM.

Core Contribution

Defines and documents “data cascades” (compounding events causing negative downstream effects from data issues in AI systems. Based on qualitative interviews with 53 AI practitioners across India, East/West Africa, and USA working in high-stakes domains (health, conservation, credit, public safety).

Key Findings

  • 92% of practitioners experienced at least one data cascade
  • 45.3% experienced two or more cascades per project
  • Cascades are opaque, delayed, and often avoidable
  • Root causes are primarily organizational and incentive-based, not purely technical

Four Cascade Triggers

  1. Physical world brittleness (54.7%): Training data too clean for messy deployment
  2. Inadequate application-domain expertise (43.4%): AI practitioners making decisions beyond their knowledge
  3. Conflicting reward systems (32.1%): Misaligned incentives between practitioners, domain experts, and field partners
  4. Poor cross-organizational documentation (20.8%): Missing metadata and context

Atoms Extracted

Molecules Extracted

Why This Source Matters

This paper empirically grounds what practitioners intuit: AI’s “data problem” is really an incentive and organizational problem. The HCI framing (published at CHI) positions data quality as a human systems challenge, not a technical one. Connects directly to themes of invisible labor, provenance, and the gap between demo and deployment.