LLM Agreement Predicts Human Agreement
When multiple LLMs agree on an answer, humans are more likely to agree too. LLM disagreement often signals genuine ambiguity rather than model error.
The Finding
Inter-LLM agreement correlates with:
- Human annotator agreement
- Task difficulty
- Ambiguity in the underlying question
Implications
For Evaluation: Treat LLM disagreement as signal, not just noise For Production: High-confidence outputs (where models agree) may be safer for automation For Annotation: LLM disagreement can flag items needing expert review
Caveats
- Correlated errors exist (shared training data, similar architectures)
- Agreement doesn’t guarantee correctness
- Works better for some task types than others
Application
Use self-consistency and multi-model voting not just for accuracy, but as uncertainty indicator for routing decisions.
Related: 05-molecule—self-consistency-through-diverse-sampling, 05-atom—uniform-confidence-problem