ML Technical Debt Taxonomy
Overview
A systematic categorization of the ways machine learning systems accumulate maintenance burden over time, beyond the technical debt familiar from traditional software engineering.
The Core Insight
ML systems have all the maintenance problems of traditional code plus an additional set of ML-specific issues. These issues exist at the system level rather than the code level, making them harder to detect with traditional tooling.
Debt Categories
Boundary Erosion ML models erode abstraction boundaries because they’re required precisely when desired behavior can’t be expressed in software logic. This manifests as entanglement (CACE principle), correction cascades, and undeclared consumers.
Data Dependencies Data dependencies cost more than code dependencies and are harder to detect. They include unstable dependencies (signals that change over time), underutilized dependencies (legacy or bundled features), and the absence of static analysis tools for data flows.
Feedback Loops Live ML systems often influence their own behavior. Direct loops occur when models affect their training data. Hidden loops occur when systems influence each other through real-world interactions without direct data connections.
System Anti-Patterns Common designs that increase long-term cost: glue code (95% of the system connecting the 5% that’s ML), pipeline jungles (organic growth in data preparation), dead experimental codepaths, and abstraction debt.
Configuration Debt Configuration lines can exceed code lines. Each is a potential failure point. Configuration is often treated as an afterthought rather than first-class code.
External World Changes Fixed thresholds become stale as distributions shift. Monitoring ML behavior requires different approaches than monitoring traditional software.
When to Use This Framework
During system design, to anticipate categories of debt before they accrue. During architecture review, to systematically check for each debt type. When prioritizing engineering investment, to understand which debt categories pose the greatest risk.
Limitations
This taxonomy was developed from large-scale production systems at Google circa 2015. Some debt types (like abstraction debt) may have evolved as the field developed better tooling. The framework doesn’t address newer concerns specific to foundation models and LLM-based systems.
Related: 05-molecule—ml-system-anti-patterns