RAG Evaluation Targets Framework
Overview
A structured approach to RAG evaluation built on pairwise relationships between system components and their outputs. The framework distinguishes six evaluation targets across retrieval and generation.
Retrieval Targets
Three relationships define what “good retrieval” means:
| Target | Relationship | What It Measures |
|---|---|---|
| Relevance | Documents ↔ Query | Do retrieved docs match the information need? |
| Comprehensiveness | Documents ↔ Documents | Do retrieved docs provide diverse, complete coverage? |
| Correctness | Documents ↔ Candidates | Are the right docs ranked above wrong docs? |
Generation Targets
Three parallel relationships define “good generation”:
| Target | Relationship | What It Measures |
|---|---|---|
| Relevance | Response ↔ Query | Does the response address what was asked? |
| Faithfulness | Response ↔ Documents | Does the response accurately reflect retrieved content? |
| Correctness | Response ↔ Ground Truth | Is the response factually accurate? |
Application
The framework is diagnostic: when a system underperforms, the relationships pinpoint where.
- Low retrieval relevance → query understanding or embedding mismatch
- Low retrieval comprehensiveness → biased retrieval or insufficient corpus
- Low faithfulness but high correctness → model ignoring context, using parametric knowledge
- High faithfulness but low correctness → accurate summarization of wrong sources
Limitations
The framework assumes ground truth exists for correctness evaluation, problematic for exploratory or creative tasks where “correct” is undefined.
Related: 05-atom—internal-external-evaluation-distinction, 05-atom—uniform-confidence-problem