RAG Evaluation Dimensions

Overview

A framework for evaluating RAG systems across three interdependent dimensions. Poor performance in one dimension cascades into the others, making holistic evaluation essential.

Components

1. Context Relevance

What it measures: How pertinent are the retrieved documents to the input query?

Key signals:

Semantic match between query and retrieved content
Coverage of information needed to answer
Absence of distracting or irrelevant material

Failure modes: Retrieved documents are off-topic, partially relevant, or correct but insufficient.

2. Answer Faithfulness

What it measures: Does the generated output remain grounded in the retrieved evidence?

Key signals:

Claims traceable to specific retrieved passages
Absence of hallucinated facts
Accurate representation of source material (no distortion)

Failure modes: Model ignores evidence, misinterprets evidence, or generates plausible-sounding content not supported by retrieval.

3. Answer Relevance

What it measures: Does the output adequately address the user query?

Key signals:

Directly answers the question asked
Appropriate level of detail
Actionable where expected

Failure modes: Technically accurate but doesn’t answer the question; too vague; tangential.

Interdependencies

These dimensions cascade:

Poor context relevance → generator has wrong inputs → likely poor faithfulness and relevance
Good retrieval + poor faithfulness → accurate sources wasted
Good faithfulness + poor relevance → accurate but unhelpful

This means single-metric evaluation is misleading. A system can score well on one dimension while failing on another.

When to Use

Debugging RAG failures: Identify which dimension is breaking down
Comparing systems: Ensure comparison captures all three dimensions, not just one
Setting evaluation strategy: Decide which dimension matters most for your use case

Limitations

Doesn’t capture efficiency, latency, or cost
Doesn’t address robustness under adversarial conditions
Faithfulness is hard to measure automatically, requires careful ground-truth or human evaluation

>heyMHK

RAG Evaluation Dimensions

RAG Evaluation Dimensions

Overview

Components

1. Context Relevance

2. Answer Faithfulness

3. Answer Relevance

Interdependencies

When to Use

Limitations

Properties

Graph view

Table of Contents

Backlinks