Data Provenance
The documented history of a data element: where it came from, how it was transformed, and who touched it along the way. The “chain of custody” for data.
What It Captures
Origin: Source system, original collection method, timestamp Transformations: Every processing step, filter, aggregation, join Actors: Systems and people who created, modified, or approved changes Lineage: Both upstream (what fed into this) and downstream (what depends on this)
Why It Matters
Debugging: When something’s wrong, provenance lets you trace back to the source Compliance: Regulations may require proving data handling practices Trust: Users can assess reliability based on source and handling Reproducibility: Understanding exactly how a result was produced
AI-Specific Importance
Training data provenance affects model behavior. If you can’t trace where your training data came from, you can’t audit for bias, copyright issues, or data poisoning. For RAG systems, provenance enables citation and fact-checking.
The Cost
Comprehensive provenance tracking adds overhead, storage, processing, and maintenance. Organizations must decide what level of provenance is worth the investment for different data types.
Related: 04-atom—data-governance, 02-molecule—content-provenance-principle, 05-atom—hallucination-inherent