RAG Evaluation Targets Framework

Overview

A structured approach to RAG evaluation built on pairwise relationships between system components and their outputs. The framework distinguishes six evaluation targets across retrieval and generation.

Retrieval Targets

Three relationships define what “good retrieval” means:

TargetRelationshipWhat It Measures
RelevanceDocuments ↔ QueryDo retrieved docs match the information need?
ComprehensivenessDocuments ↔ DocumentsDo retrieved docs provide diverse, complete coverage?
CorrectnessDocuments ↔ CandidatesAre the right docs ranked above wrong docs?

Generation Targets

Three parallel relationships define “good generation”:

TargetRelationshipWhat It Measures
RelevanceResponse ↔ QueryDoes the response address what was asked?
FaithfulnessResponse ↔ DocumentsDoes the response accurately reflect retrieved content?
CorrectnessResponse ↔ Ground TruthIs the response factually accurate?

Application

The framework is diagnostic: when a system underperforms, the relationships pinpoint where.

  • Low retrieval relevance → query understanding or embedding mismatch
  • Low retrieval comprehensiveness → biased retrieval or insufficient corpus
  • Low faithfulness but high correctness → model ignoring context, using parametric knowledge
  • High faithfulness but low correctness → accurate summarization of wrong sources

Limitations

The framework assumes ground truth exists for correctness evaluation, problematic for exploratory or creative tasks where “correct” is undefined.

Related: 05-atom—internal-external-evaluation-distinction, 05-atom—uniform-confidence-problem