ML System Anti-Patterns

Context

Production ML systems tend toward certain design patterns that increase long-term maintenance cost. These are recognizable, avoidable, and worth explicitly naming.

The Problems

Glue Code Using general-purpose ML packages often results in massive amounts of supporting code to get data in and out. This glue code freezes the system to one package’s peculiarities, making alternatives prohibitively expensive to test.

The ratio matters: a mature system might be 5% ML code and 95% glue. At that point, building a clean native solution may cost less than maintaining the generic package integration.

Pipeline Jungles Data preparation pipelines evolve organically as new signals are identified and sources added. Without holistic design, the result is a jungle of scrapes, joins, and sampling steps with intermediate files. Managing errors and failures becomes expensive. Testing requires end-to-end integration tests.

Pipeline jungles can only be fixed by thinking holistically about data collection and feature extraction, sometimes requiring a ground-up redesign.

Dead Experimental Codepaths Short-term experiments implemented as conditional branches within production code. Each individual change seems low-cost, no infrastructure rework needed. Over time, accumulated branches create exponential complexity and maintenance burden.

Knight Capital’s $465 million loss in 45 minutes was traced partly to unexpected behavior from obsolete experimental codepaths.

Abstraction Debt The field lacks strong abstractions for ML systems. Nothing approaches the success of the relational database as a basic abstraction. The widespread use of MapReduce for ML was driven by the absence of better alternatives, not by its fitness for iterative algorithms.

The Solutions

Wrap black-box packages into common APIs to enable package switching. Design data pipelines holistically from the start. Periodically audit and remove dead experimental branches. Push for better abstractions even when expedience suggests otherwise.

Consequences of Ignoring

Systems become impossible to improve. Testing costs escalate. Team velocity drops over time even as the team grows. New engineers require months to become productive.

The Root Cause

Often, these anti-patterns stem from separating “research” and “engineering” roles too sharply. When ML packages are developed in isolation, they appear as black boxes to the teams deploying them. Embedded teams where researchers and engineers work together, often as the same people, reduce this friction.

Related: 05-molecule—ml-technical-debt-taxonomy, 05-atom—deploy-maintain-dichotomy