Common Words Matter More Than You’d Expect

TFIDF weighting didn’t outperform simple count vectorization for highlight classification. This is counterintuitive (TFIDF is supposed to downweight common words and elevate distinctive ones.

The implication: common words, the ones that appear in multiple documents, are at least as important for distinguishing highlights from non-highlights as rare, “distinctive” words.

One explanation connects to psychology research (Pennebaker, 2011): use of function words, especially first-person singular pronouns like “I,” changes based on someone’s psychological state. In market research transcripts, moments where people shift how they express themselves may be exactly what makes content highlight-worthy.

Another explanation: TFIDF is designed to spot similarity between one sample and the rest of a corpus. But classification tries to draw boundaries between clusters. Smoothing that helps with similarity may blur the boundaries you’re trying to find.

This suggests being cautious about blindly applying standard NLP preprocessing. The “best practice” may not fit the problem.

Related: [None yet]