LLM Annotation Suitability Framework
Overview
A decision framework for determining whether LLM annotation is appropriate for a given task, based on empirical patterns from large-scale evaluation studies.
Components
Signal Type Assessment
Use LLMs when:
- Target categories map to explicit textual signals (keywords, phrases, patterns)
- Annotation can be determined from the text alone
- Little external context or world knowledge required
- The task is closer to extraction than interpretation
Avoid LLMs when:
- Annotation requires inference about unstated meaning
- Cultural, historical, or domain context is essential
- The concept is “latent” (not directly observable in text
- Human judgment typically varies even among experts
Inter-LLM Agreement Test
Before committing to LLM annotation at scale:
- Run 3–5 different LLMs on a sample (n=100–500)
- Calculate pairwise intercoder reliability (Krippendorff’s alpha, not simple agreement)
- Interpret:
- α > 0.67: LLM annotation likely appropriate
- α 0.4–0.67: Proceed with caution, validate against expert sample
- α < 0.4: Task probably requires human annotation
High LLM-to-LLM agreement predicts high LLM-to-human agreement. Low LLM-to-LLM agreement signals fundamental ambiguity in the task.
Model Selection Criteria
| Factor | Recommendation |
|---|---|
| Minimum size | 12B parameters |
| Preferred size | 70B+ when available |
| Open vs. proprietary | Open-weight for reproducibility |
| Reasoning models | No advantage for standard annotation |
Validation Requirements
Even when LLMs appear suitable:
- Validate against expert-coded sample (not crowd-sourced)
- Use chance-corrected reliability metrics
- Examine confusion matrix for systematic category errors
- If using bias correction, budget for 600–1000 ground truth samples
When to Use
- Scoping phase: deciding whether LLM annotation is viable
- Model selection: choosing among available LLMs
- Quality assurance: interpreting validation results
- Reporting: justifying annotation methodology
Limitations
This framework assumes annotation tasks with discrete categories. Continuous annotation (e.g., probability scores, ratings) introduces additional concerns about LLM calibration not addressed here.
The inter-LLM agreement test adds cost and complexity. For low-stakes or exploratory analysis, it may be acceptable to skip this step while acknowledging the limitation.
Related: [None yet]