The Benchmark-to-Production Gap in NLP

A highlight classifier achieved ROC AUC of 0.93-0.94 on individual clips. But when applied to real documents through sampling methods, performance dropped significantly, mean ROC AUC fell to 0.50-0.67 depending on document length and h-coverage.

The problem: classifiers are trained and evaluated on pre-segmented samples. Real documents require segmentation before classification. The segmentation strategy introduces errors that don’t show up in benchmarks.

Four sampling strategies were tested (sequential sentences, h-score non-overlap, weighted h-score, positive summation). Each had different precision/recall tradeoffs. None preserved the standalone classifier’s performance.

This is a pattern worth watching for. Many NLP benchmarks test models on conveniently-sized inputs. Production systems face messier data that requires preprocessing steps not reflected in benchmark performance.

The model was great. The model-plus-sampling-system was mediocre. The gap is where deployment difficulties hide.

Related: [None yet]