RAG Evaluation Dimensions

Overview

A framework for evaluating RAG systems across three interdependent dimensions. Poor performance in one dimension cascades into the others, making holistic evaluation essential.

Components

1. Context Relevance

What it measures: How pertinent are the retrieved documents to the input query?

Key signals:

  • Semantic match between query and retrieved content
  • Coverage of information needed to answer
  • Absence of distracting or irrelevant material

Failure modes: Retrieved documents are off-topic, partially relevant, or correct but insufficient.

2. Answer Faithfulness

What it measures: Does the generated output remain grounded in the retrieved evidence?

Key signals:

  • Claims traceable to specific retrieved passages
  • Absence of hallucinated facts
  • Accurate representation of source material (no distortion)

Failure modes: Model ignores evidence, misinterprets evidence, or generates plausible-sounding content not supported by retrieval.

3. Answer Relevance

What it measures: Does the output adequately address the user query?

Key signals:

  • Directly answers the question asked
  • Appropriate level of detail
  • Actionable where expected

Failure modes: Technically accurate but doesn’t answer the question; too vague; tangential.

Interdependencies

These dimensions cascade:

  • Poor context relevance → generator has wrong inputs → likely poor faithfulness and relevance
  • Good retrieval + poor faithfulness → accurate sources wasted
  • Good faithfulness + poor relevance → accurate but unhelpful

This means single-metric evaluation is misleading. A system can score well on one dimension while failing on another.

When to Use

  • Debugging RAG failures: Identify which dimension is breaking down
  • Comparing systems: Ensure comparison captures all three dimensions, not just one
  • Setting evaluation strategy: Decide which dimension matters most for your use case

Limitations

  • Doesn’t capture efficiency, latency, or cost
  • Doesn’t address robustness under adversarial conditions
  • Faithfulness is hard to measure automatically, requires careful ground-truth or human evaluation

Related: 05-atom—rag-core-equation, 05-atom—rag-seven-failure-points, 05-molecule—rag-architecture-taxonomy