Provenance by Design

Building data provenance tracking into systems from the start, rather than retrofitting it later. Provenance is architectural, not a feature.

The Retrofit Problem

Adding provenance after the fact is expensive and incomplete. Systems not designed for provenance lack the hooks needed to capture transformation history.

Design Principles

Immutable Logging: Record transformations, don’t overwrite Unique Identifiers: Every data element traceable Transformation Capture: What, when, who, why for each change Schema Evolution Tracking: How structures changed over time External Reference Linking: Connect to source systems

Implementation Patterns

  • Event sourcing: store events, derive state
  • Versioned datasets: keep historical snapshots
  • Lineage graphs: explicit dependency tracking
  • Metadata catalogs: searchable provenance records

Cost-Benefit

Full provenance is expensive (storage, compute, complexity). Design decisions should match provenance granularity to actual audit and debugging needs.

AI Relevance

ML systems need training data provenance for:

  • Debugging model behavior
  • Compliance with data rights
  • Bias investigation
  • Reproducibility

Related: 04-atom—data-provenance, 04-atom—data-governance