Conventional vs LLM-Based RAG Evaluation

Overview

Two paradigms for evaluating RAG systems with different strengths, costs, and failure modes. The choice isn’t either/or, robust evaluation typically combines both.

Conventional Methods

IR-derived metrics (retrieval focus):

  • Non-rank: Accuracy, Recall@K, Precision@K, F1
  • Rank-aware: MRR, NDCG, MAP

NLG-derived metrics (generation focus):

  • Surface matching: EM (Exact Match), BLEU, ROUGE
  • Semantic: BERTScore, METEOR
  • Distribution: Perplexity

Strengths:

  • Reproducible and deterministic
  • Computationally cheap
  • Well-understood failure modes
  • No dependency on external APIs

Weaknesses:

  • Surface-level: struggle with paraphrase and semantic equivalence
  • Require ground truth labels
  • Miss nuanced correctness (factually wrong but lexically similar passes)

LLM-Based Methods

Output-based:

  • Prompt LLM judges (RAGAS, GPTScore)
  • Atomic fact verification (FactScore)
  • Key point extraction (KPR metric)

Representation-based:

  • Embedding similarity with LLM encoders
  • Hidden state analysis (Thrust metric)
  • Information bottleneck approaches

Strengths:

  • Capture semantic equivalence
  • Can evaluate open-ended responses
  • Align better with human judgment
  • Don’t require exact match labels

Weaknesses:

  • Non-deterministic (temperature, prompt variation)
  • Expensive (API costs, compute)
  • Circular: LLM evaluating LLM output
  • Inherit evaluator biases

When to Use Each

ScenarioRecommended Approach
High-stakes factualityBoth + human review
Development iterationLLM-based (faster feedback)
Benchmark comparisonConventional (reproducibility)
Open-ended generationLLM-based
Retrieval rankingConventional IR metrics
Semantic faithfulnessLLM-based

The Meta-Problem

LLM-based evaluation introduces a recursive challenge: the evaluator itself needs evaluation. If GPT-4 judges responses, who judges GPT-4’s judgments? This argues for evaluation ensembles rather than single-source authority.

Related: 05-atom—uniform-confidence-problem