Provenance by Design
Building data provenance tracking into systems from the start, rather than retrofitting it later. Provenance is architectural, not a feature.
The Retrofit Problem
Adding provenance after the fact is expensive and incomplete. Systems not designed for provenance lack the hooks needed to capture transformation history.
Design Principles
Immutable Logging: Record transformations, don’t overwrite Unique Identifiers: Every data element traceable Transformation Capture: What, when, who, why for each change Schema Evolution Tracking: How structures changed over time External Reference Linking: Connect to source systems
Implementation Patterns
- Event sourcing: store events, derive state
- Versioned datasets: keep historical snapshots
- Lineage graphs: explicit dependency tracking
- Metadata catalogs: searchable provenance records
Cost-Benefit
Full provenance is expensive (storage, compute, complexity). Design decisions should match provenance granularity to actual audit and debugging needs.
AI Relevance
ML systems need training data provenance for:
- Debugging model behavior
- Compliance with data rights
- Bias investigation
- Reproducibility