Conventional vs LLM-Based RAG Evaluation
Overview
Two paradigms for evaluating RAG systems with different strengths, costs, and failure modes. The choice isn’t either/or, robust evaluation typically combines both.
Conventional Methods
IR-derived metrics (retrieval focus):
- Non-rank: Accuracy, Recall@K, Precision@K, F1
- Rank-aware: MRR, NDCG, MAP
NLG-derived metrics (generation focus):
- Surface matching: EM (Exact Match), BLEU, ROUGE
- Semantic: BERTScore, METEOR
- Distribution: Perplexity
Strengths:
- Reproducible and deterministic
- Computationally cheap
- Well-understood failure modes
- No dependency on external APIs
Weaknesses:
- Surface-level: struggle with paraphrase and semantic equivalence
- Require ground truth labels
- Miss nuanced correctness (factually wrong but lexically similar passes)
LLM-Based Methods
Output-based:
- Prompt LLM judges (RAGAS, GPTScore)
- Atomic fact verification (FactScore)
- Key point extraction (KPR metric)
Representation-based:
- Embedding similarity with LLM encoders
- Hidden state analysis (Thrust metric)
- Information bottleneck approaches
Strengths:
- Capture semantic equivalence
- Can evaluate open-ended responses
- Align better with human judgment
- Don’t require exact match labels
Weaknesses:
- Non-deterministic (temperature, prompt variation)
- Expensive (API costs, compute)
- Circular: LLM evaluating LLM output
- Inherit evaluator biases
When to Use Each
| Scenario | Recommended Approach |
|---|---|
| High-stakes factuality | Both + human review |
| Development iteration | LLM-based (faster feedback) |
| Benchmark comparison | Conventional (reproducibility) |
| Open-ended generation | LLM-based |
| Retrieval ranking | Conventional IR metrics |
| Semantic faithfulness | LLM-based |
The Meta-Problem
LLM-based evaluation introduces a recursive challenge: the evaluator itself needs evaluation. If GPT-4 judges responses, who judges GPT-4’s judgments? This argues for evaluation ensembles rather than single-source authority.
Related: 05-atom—uniform-confidence-problem