RAG Evaluation as Diagnostic, Not Scorecard

The Principle

Effective RAG evaluation decomposes system performance into component-specific metrics that point to specific interventions, not a single quality score that tells you something is wrong without telling you where.

Why This Matters

A RAG system can fail in multiple ways:

  • Retriever returns irrelevant documents → Relevance is low
  • Retriever works but generator ignores the context → Utilization is low
  • Generator uses context but adds unsupported claims → Adherence is low
  • Generator is faithful but answers incompletely → Completeness is low

An end-to-end quality score obscures which component needs work. If your response is wrong, is it because you retrieved the wrong documents or because you synthesized the right documents poorly? Without decomposed metrics, you’re debugging blind.

How to Apply

  1. Evaluate retriever and generator separately: Different interventions apply to each
  2. Use utilization alongside adherence: A model can be faithful (no hallucination) while underutilizing available information
  3. Track metrics over time per component: Identify which component is your bottleneck
  4. Match metric failures to interventions:
    • Low relevance → Improve retrieval (better embeddings, chunking, reranking)
    • Low utilization → Improve prompting or generator
    • Low adherence → Add guardrails or reduce temperature
    • Low completeness → Retrieve more context or improve synthesis

When This Especially Matters

Any production RAG system where you need to systematically improve quality rather than guess at what’s broken. Any system where different teams own retrieval vs generation components.

Related: 05-atom—uniform-confidence-problem, 07-molecule—ui-as-ultimate-guardrail