RAG Evaluation Dimensions
Overview
A framework for evaluating RAG systems across three interdependent dimensions. Poor performance in one dimension cascades into the others, making holistic evaluation essential.
Components
1. Context Relevance
What it measures: How pertinent are the retrieved documents to the input query?
Key signals:
- Semantic match between query and retrieved content
- Coverage of information needed to answer
- Absence of distracting or irrelevant material
Failure modes: Retrieved documents are off-topic, partially relevant, or correct but insufficient.
2. Answer Faithfulness
What it measures: Does the generated output remain grounded in the retrieved evidence?
Key signals:
- Claims traceable to specific retrieved passages
- Absence of hallucinated facts
- Accurate representation of source material (no distortion)
Failure modes: Model ignores evidence, misinterprets evidence, or generates plausible-sounding content not supported by retrieval.
3. Answer Relevance
What it measures: Does the output adequately address the user query?
Key signals:
- Directly answers the question asked
- Appropriate level of detail
- Actionable where expected
Failure modes: Technically accurate but doesn’t answer the question; too vague; tangential.
Interdependencies
These dimensions cascade:
- Poor context relevance → generator has wrong inputs → likely poor faithfulness and relevance
- Good retrieval + poor faithfulness → accurate sources wasted
- Good faithfulness + poor relevance → accurate but unhelpful
This means single-metric evaluation is misleading. A system can score well on one dimension while failing on another.
When to Use
- Debugging RAG failures: Identify which dimension is breaking down
- Comparing systems: Ensure comparison captures all three dimensions, not just one
- Setting evaluation strategy: Decide which dimension matters most for your use case
Limitations
- Doesn’t capture efficiency, latency, or cost
- Doesn’t address robustness under adversarial conditions
- Faithfulness is hard to measure automatically, requires careful ground-truth or human evaluation
Related: 05-atom—rag-core-equation, 05-atom—rag-seven-failure-points, 05-molecule—rag-architecture-taxonomy