Punctuation Dramatically Improves Sequential NLP Models

Including punctuation in input sequences was the single biggest performance driver for LSTM-based highlight classification. Even the worst punctuated models outperformed the best non-punctuated ones.

This makes sense in retrospect: without punctuation, grammar is absent from text. For models that process sequential input (RNNs, LSTMs, Transformers, grammatical structure provides crucial context.

Notably, this effect was specific to sequential models. GBM (gradient boosting) models showed no improvement from punctuation, because bag-of-words representations lose sequence information anyway.

The practical takeaway: preprocessing pipelines that strip punctuation may be discarding signal that matters. The decision to remove punctuation should be deliberate, not a default.

Related: [None yet]