The LLM-as-Evaluator Paradigm

A fundamental shift in AI evaluation: using LLMs to evaluate other AI systems rather than relying solely on traditional metrics.

Two main approaches have emerged:

Output-based methods prompt LLMs to explicitly judge text outputs. Systems like RAGAS instruct GPT to check whether responses are supported by retrieved context. This works with both open and closed models but depends heavily on prompt design.

Representation-based methods analyze vector representations in intermediate or final layers. These capture semantic patterns that surface-level text comparison misses, but lose interpretability, a similarity score doesn’t explain which factual detail is wrong.

The paradigm shift matters because traditional metrics (BLEU, ROUGE, exact match) struggle with semantic equivalence. LLM judges can recognize that “The capital of France is Paris” and “Paris serves as France’s capital city” express the same fact, while n-gram metrics see them as partially different.

The tradeoff: LLM evaluation introduces new biases and costs while reducing human annotation burden. The evaluator becomes another system that needs evaluation.

Related: [None yet]