The 12B Parameter Threshold

For annotation tasks, model size correlates with reliability, but with diminishing returns.

Models above ~12 billion parameters show consistent behavior:

  • Higher intercoder reliability (α > 0.5) among themselves
  • Less sensitivity to minor prompt variations
  • More stable annotation patterns across format changes

Smaller models (< 8B) exhibit:

  • Lower agreement with all other annotators (including other small models)
  • Greater sensitivity to prompt formatting
  • More outlier predictions

Above 70B parameters, the differences between proprietary and open-weight models largely disappear. Reasoning-enhanced models show no notable advantage for straightforward annotation tasks, and slightly worse instruction-following in some cases.

The practical guidance: 12B is a minimum for research-grade annotation; 70B+ if available, but don’t pay premium for reasoning capabilities you won’t use.

Related: [None yet]