The 12B Parameter Threshold
For annotation tasks, model size correlates with reliability, but with diminishing returns.
Models above ~12 billion parameters show consistent behavior:
- Higher intercoder reliability (α > 0.5) among themselves
- Less sensitivity to minor prompt variations
- More stable annotation patterns across format changes
Smaller models (< 8B) exhibit:
- Lower agreement with all other annotators (including other small models)
- Greater sensitivity to prompt formatting
- More outlier predictions
Above 70B parameters, the differences between proprietary and open-weight models largely disappear. Reasoning-enhanced models show no notable advantage for straightforward annotation tasks, and slightly worse instruction-following in some cases.
The practical guidance: 12B is a minimum for research-grade annotation; 70B+ if available, but don’t pay premium for reasoning capabilities you won’t use.
Related: [None yet]