LLM Annotation Reliability Gap
The systematic difference between LLM annotation performance on benchmarks versus real-world annotation tasks. LLMs often appear to match or exceed human annotators on standard tasks but fail in subtle ways on novel or domain-specific annotation.
The Gap
Benchmark Performance: LLMs achieve high agreement with human labels on established datasets Production Reality: Performance degrades on:
- Domain-specific terminology
- Edge cases not in training data
- Tasks requiring world knowledge cutoff
- Subjective judgments requiring calibration
Why It Happens
- Training Data Overlap: Benchmarks may be in training data
- Distribution Shift: Real annotation tasks differ from benchmark distributions
- No Calibration: LLMs can’t be “trained” on your specific annotation guidelines
- Confidence Without Competence: Uniform confidence masks uncertainty
Implications
- Don’t assume benchmark performance transfers
- Always validate on held-out examples from your specific task
- Consider LLM annotation as first pass, not final label
- Monitor for drift over time
Related: 05-atom—uniform-confidence-problem, 03-molecule—annotation-task-suitability-framework