Yang, Wang, Zhou & Xu (2025) — LLM Annotation Evaluation
Summary
Large-scale empirical evaluation of using LLMs for data annotation in political science research. The authors re-annotated 14 published studies using 15 different LLMs (300+ million annotations) to assess reliability, downstream effects, and mitigation strategies.
Core Framing
The paper treats all annotators, human, supervised model, and LLM, as entities with distinct biases subject to measurement error. This shifts from the “LLM accuracy” frame to “annotator reliability.”
Key Findings
- Low intercoder reliability: LLM-to-human/supervised Krippendorff’s alpha ranges 0.12–0.41 (recommended threshold is ≥0.8)
- Moderate LLM-to-LLM agreement: Pairwise alpha 0.16–0.69, suggesting distinct biases across models
- Downstream consequences: Different LLM annotations change original study conclusions 37% of the time
- Estimate variability: Standard deviation of LLM-derived estimates is 1.9× the mean estimate
- Positive correlation: When LLMs agree with each other, they also tend to agree with human annotators
- Task characteristics matter: High agreement on explicit textual evidence; low on inferential tasks
- Model size threshold: 12b+ parameters produces more reliable annotations
- In-context learning: Marginal improvement, plateaus at 5-shot
- Bias correction: DSL method works but requires 600–1000 ground truth samples
Methodology
- 14 studies from APSR, AJPS, JOP, BJPS, PSRM (2018–2025)
- 15 LLMs: 4b–120b parameters, proprietary and open-weight
- Standardized prompt design with markdown structure
- Deterministic inference using vllm engine
Recommendations
- Use LLMs selectively (explicit evidence tasks, not latent concepts)
- Prefer large open-weight models (≥12b parameters)
- Ensure reproducibility through deterministic inference
- Validate against expert-coded samples
- Account for measurement error in downstream analysis
Extracted Content
- 05-atom—llm-annotation-reliability-gap
- 05-atom—llm-agreement-predicts-human-agreement
- 03-molecule—annotation-task-suitability-framework
- 03-atom—researcher-degrees-of-freedom-llm-choice
Notes
The simple agreement rate vs. intercoder reliability discrepancy (high agreement but low alpha) is explained by class imbalance in annotation datasets (Krippendorff’s alpha corrects for chance agreement, revealing systematic disagreement.