Yang, Wang, Zhou & Xu (2025) — LLM Annotation Evaluation

Summary

Large-scale empirical evaluation of using LLMs for data annotation in political science research. The authors re-annotated 14 published studies using 15 different LLMs (300+ million annotations) to assess reliability, downstream effects, and mitigation strategies.

Core Framing

The paper treats all annotators, human, supervised model, and LLM, as entities with distinct biases subject to measurement error. This shifts from the “LLM accuracy” frame to “annotator reliability.”

Key Findings

Low intercoder reliability: LLM-to-human/supervised Krippendorff’s alpha ranges 0.12–0.41 (recommended threshold is ≥0.8)
Moderate LLM-to-LLM agreement: Pairwise alpha 0.16–0.69, suggesting distinct biases across models
Downstream consequences: Different LLM annotations change original study conclusions 37% of the time
Estimate variability: Standard deviation of LLM-derived estimates is 1.9× the mean estimate
Positive correlation: When LLMs agree with each other, they also tend to agree with human annotators
Task characteristics matter: High agreement on explicit textual evidence; low on inferential tasks
Model size threshold: 12b+ parameters produces more reliable annotations
In-context learning: Marginal improvement, plateaus at 5-shot
Bias correction: DSL method works but requires 600–1000 ground truth samples

Methodology

14 studies from APSR, AJPS, JOP, BJPS, PSRM (2018–2025)
15 LLMs: 4b–120b parameters, proprietary and open-weight
Standardized prompt design with markdown structure
Deterministic inference using vllm engine

Recommendations

Use LLMs selectively (explicit evidence tasks, not latent concepts)
Prefer large open-weight models (≥12b parameters)
Ensure reproducibility through deterministic inference
Validate against expert-coded samples
Account for measurement error in downstream analysis

Extracted Content

Notes

The simple agreement rate vs. intercoder reliability discrepancy (high agreement but low alpha) is explained by class imbalance in annotation datasets (Krippendorff’s alpha corrects for chance agreement, revealing systematic disagreement.

>heyMHK

Data Annotation with Large Language Models: Lessons from a Large Empirical Evaluation