Data Versioning Complexity
While software engineering has well-designed technologies for versioning code (git, etc.), the same is not true for data.
The challenge: A given dataset may contain data from several different schema regimes. When a single engineer gathers and processes this data, they can keep track of these unwritten details. When project sizes scale, maintaining this tribal knowledge becomes a burden.
Key differences between code versioning and data versioning:
| Code | Data |
|---|---|
| Changes are discrete commits | Changes can be continuous streams |
| Schema is explicit in type systems | Schema often implicit or undocumented |
| Diffs are human-readable | Data diffs are hard to interpret |
| Size is manageable | Volume can be massive |
| Provenance is clear (who changed what) | Provenance requires explicit tracking |
Proposed solutions from the research community include “datasheets for datasets” (Gebru et al.) that document metadata characteristics, and tools like Datadiff that enable comparison between dataset versions.
Best practice: “Each model is tagged with a provenance tag that explains with which data it has been trained on and which version of the model. Each dataset is tagged with information about where it originated from and which version of the code was used to extract it.”
Related: 04-atom—data-provenance, 05-atom—three-ml-engineering-differences