Data Provenance

The documented history of a data element: where it came from, how it was transformed, and who touched it along the way. The “chain of custody” for data.

What It Captures

Origin: Source system, original collection method, timestamp Transformations: Every processing step, filter, aggregation, join Actors: Systems and people who created, modified, or approved changes Lineage: Both upstream (what fed into this) and downstream (what depends on this)

Why It Matters

Debugging: When something’s wrong, provenance lets you trace back to the source Compliance: Regulations may require proving data handling practices Trust: Users can assess reliability based on source and handling Reproducibility: Understanding exactly how a result was produced

AI-Specific Importance

Training data provenance affects model behavior. If you can’t trace where your training data came from, you can’t audit for bias, copyright issues, or data poisoning. For RAG systems, provenance enables citation and fact-checking.

The Cost

Comprehensive provenance tracking adds overhead, storage, processing, and maintenance. Organizations must decide what level of provenance is worth the investment for different data types.

Related: 04-atom—data-governance, 02-molecule—content-provenance-principle, 05-atom—hallucination-inherent