The Benchmark-to-Production Gap in NLP
A highlight classifier achieved ROC AUC of 0.93-0.94 on individual clips. But when applied to real documents through sampling methods, performance dropped significantly, mean ROC AUC fell to 0.50-0.67 depending on document length and h-coverage.
The problem: classifiers are trained and evaluated on pre-segmented samples. Real documents require segmentation before classification. The segmentation strategy introduces errors that don’t show up in benchmarks.
Four sampling strategies were tested (sequential sentences, h-score non-overlap, weighted h-score, positive summation). Each had different precision/recall tradeoffs. None preserved the standalone classifier’s performance.
This is a pattern worth watching for. Many NLP benchmarks test models on conveniently-sized inputs. Production systems face messier data that requires preprocessing steps not reflected in benchmark performance.
The model was great. The model-plus-sampling-system was mediocre. The gap is where deployment difficulties hide.
Related: [None yet]