LLM Agreement Predicts Human Agreement

When multiple LLMs agree on an answer, humans are more likely to agree too. LLM disagreement often signals genuine ambiguity rather than model error.

The Finding

Inter-LLM agreement correlates with:

  • Human annotator agreement
  • Task difficulty
  • Ambiguity in the underlying question

Implications

For Evaluation: Treat LLM disagreement as signal, not just noise For Production: High-confidence outputs (where models agree) may be safer for automation For Annotation: LLM disagreement can flag items needing expert review

Caveats

  • Correlated errors exist (shared training data, similar architectures)
  • Agreement doesn’t guarantee correctness
  • Works better for some task types than others

Application

Use self-consistency and multi-model voting not just for accuracy, but as uncertainty indicator for routing decisions.

Related: 05-molecule—self-consistency-through-diverse-sampling, 05-atom—uniform-confidence-problem