Conventional vs LLM-Based RAG Evaluation

Overview

Two paradigms for evaluating RAG systems with different strengths, costs, and failure modes. The choice isn’t either/or, robust evaluation typically combines both.

Conventional Methods

IR-derived metrics (retrieval focus):

Non-rank: Accuracy, Recall@K, Precision@K, F1
Rank-aware: MRR, NDCG, MAP

NLG-derived metrics (generation focus):

Surface matching: EM (Exact Match), BLEU, ROUGE
Semantic: BERTScore, METEOR
Distribution: Perplexity

Strengths:

Reproducible and deterministic
Computationally cheap
Well-understood failure modes
No dependency on external APIs

Weaknesses:

Surface-level: struggle with paraphrase and semantic equivalence
Require ground truth labels
Miss nuanced correctness (factually wrong but lexically similar passes)

LLM-Based Methods

Output-based:

Prompt LLM judges (RAGAS, GPTScore)
Atomic fact verification (FactScore)
Key point extraction (KPR metric)

Representation-based:

Embedding similarity with LLM encoders
Hidden state analysis (Thrust metric)
Information bottleneck approaches

Strengths:

Capture semantic equivalence
Can evaluate open-ended responses
Align better with human judgment
Don’t require exact match labels

Weaknesses:

Non-deterministic (temperature, prompt variation)
Expensive (API costs, compute)
Circular: LLM evaluating LLM output
Inherit evaluator biases

When to Use Each

Scenario	Recommended Approach
High-stakes factuality	Both + human review
Development iteration	LLM-based (faster feedback)
Benchmark comparison	Conventional (reproducibility)
Open-ended generation	LLM-based
Retrieval ranking	Conventional IR metrics
Semantic faithfulness	LLM-based

The Meta-Problem

LLM-based evaluation introduces a recursive challenge: the evaluator itself needs evaluation. If GPT-4 judges responses, who judges GPT-4’s judgments? This argues for evaluation ensembles rather than single-source authority.

>heyMHK

Conventional vs LLM-Based RAG Evaluation

Conventional vs LLM-Based RAG Evaluation

Overview

Conventional Methods

LLM-Based Methods

When to Use Each

The Meta-Problem

Properties

Graph view

Table of Contents