RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems

Citation: Friel, R., Belyi, M., & Sanyal, A. (2024). RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems. arXiv:2407.11005.

Core Contribution

First comprehensive large-scale RAG benchmark (100k examples) with the TRACe evaluation framework, four metrics that provide explainable, actionable feedback on both retriever and generator components.

Framing Analysis

The authors position evaluation as the bottleneck for RAG improvement. Despite RAG becoming “standard architectural pattern” for domain-specific knowledge, there’s no unified way to evaluate these systems. The framing is notable: they’re not proposing a better RAG architecture, but rather arguing that you can’t improve what you can’t measure properly.

The transferable insight in this framing: evaluation frameworks often lag behind system development. When a pattern becomes “standard,” the lack of evaluation infrastructure becomes the limiting factor, not the architecture itself.

Key Concepts

TRACe Framework

Four evaluation metrics that decompose RAG quality:

uTilization: Did the generator actually use the retrieved information?
Relevance: Did the retriever return context pertinent to the query?
Adherence: Does the response stay grounded in the context (no hallucination)?
Completeness: Does the response fully address what was asked?

Utilization, Adherence, and Completeness evaluate the generator. Relevance evaluates the retriever. This decomposition enables targeted debugging.

Benchmark Design

100k examples across 5 industry domains (biomedical, legal, customer support, finance, general knowledge)
12 component datasets combined
Context lengths from 100 to 11k tokens
Real-world sourcing (user manuals, industry corpora)

Key Findings

Fine-tuned small models beat LLM judges: A 400M parameter DeBERTa model outperforms few-shot billion-parameter LLM judges on RAG evaluation tasks. AUROC scores range 0.64-0.86 for hallucination detection.
LLMs struggle at meta-evaluation: LLMs are better at performing tasks than evaluating task performance. The evaluation task is fundamentally different from the generation task.
Domain transfer is possible: A model trained on general knowledge data shows reasonable out-of-domain generalization, though performance degrades.
Gap remains: Even the best-performing evaluator has significant gap to ground truth, indicating room for improvement.

Extracted Content

→ 05-atom—trace-framework-decomposition → 05-atom—small-models-beat-llm-judges → 05-atom—evaluation-lags-architecture → 05-molecule—rag-evaluation-as-diagnostic → 07-molecule—actionable-metrics-pattern

>heyMHK

RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems

RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems

Core Contribution

Framing Analysis

Key Concepts

TRACe Framework

Benchmark Design

Key Findings

Extracted Content

Properties

Graph view

Table of Contents