Small Fine-Tuned Models Beat LLM Judges
A 400M parameter DeBERTa model fine-tuned on RAG evaluation data outperforms few-shot billion-parameter LLM judges at detecting hallucinations and assessing RAG quality.
This isn’t an anomaly. The pattern appears across multiple evaluation tasks: LLMs are better at performing tasks than evaluating task performance. Generation and evaluation are fundamentally different cognitive operations. A model optimized for fluent generation may actually be biased toward fluent-sounding outputs when judging, regardless of factual grounding.
The practical implication: throwing your largest model at evaluation is not automatically the best strategy. Purpose-built evaluation models, smaller, cheaper, faster, can be more accurate.
Related: [None yet]