LLM Consensus as Quality Proxy
When multiple LLMs agree with each other on an annotation task, they also tend to agree with human annotators.
This correlation suggests an underlying structure to annotation tasks: some have clearer “correct” answers that both humans and LLMs converge toward; others involve ambiguity where disagreement proliferates.
The practical implication: inter-LLM agreement can serve as a signal for task suitability. High LLM-to-LLM agreement → more likely to align with human judgment. Low LLM-to-LLM agreement → the task may require human judgment, contextual knowledge, or clearer operationalization.
This pattern enables a screening approach: run multiple LLMs on a sample, measure their agreement, and use that to decide whether LLM annotation is appropriate for the full dataset.
Related: [None yet]