RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems
Citation: Friel, R., Belyi, M., & Sanyal, A. (2024). RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems. arXiv:2407.11005.
Core Contribution
First comprehensive large-scale RAG benchmark (100k examples) with the TRACe evaluation framework, four metrics that provide explainable, actionable feedback on both retriever and generator components.
Framing Analysis
The authors position evaluation as the bottleneck for RAG improvement. Despite RAG becoming “standard architectural pattern” for domain-specific knowledge, there’s no unified way to evaluate these systems. The framing is notable: they’re not proposing a better RAG architecture, but rather arguing that you can’t improve what you can’t measure properly.
The transferable insight in this framing: evaluation frameworks often lag behind system development. When a pattern becomes “standard,” the lack of evaluation infrastructure becomes the limiting factor, not the architecture itself.
Key Concepts
TRACe Framework
Four evaluation metrics that decompose RAG quality:
- uTilization: Did the generator actually use the retrieved information?
- Relevance: Did the retriever return context pertinent to the query?
- Adherence: Does the response stay grounded in the context (no hallucination)?
- Completeness: Does the response fully address what was asked?
Utilization, Adherence, and Completeness evaluate the generator. Relevance evaluates the retriever. This decomposition enables targeted debugging.
Benchmark Design
- 100k examples across 5 industry domains (biomedical, legal, customer support, finance, general knowledge)
- 12 component datasets combined
- Context lengths from 100 to 11k tokens
- Real-world sourcing (user manuals, industry corpora)
Key Findings
-
Fine-tuned small models beat LLM judges: A 400M parameter DeBERTa model outperforms few-shot billion-parameter LLM judges on RAG evaluation tasks. AUROC scores range 0.64-0.86 for hallucination detection.
-
LLMs struggle at meta-evaluation: LLMs are better at performing tasks than evaluating task performance. The evaluation task is fundamentally different from the generation task.
-
Domain transfer is possible: A model trained on general knowledge data shows reasonable out-of-domain generalization, though performance degrades.
-
Gap remains: Even the best-performing evaluator has significant gap to ground truth, indicating room for improvement.
Extracted Content
→ 05-atom—trace-framework-decomposition → 05-atom—small-models-beat-llm-judges → 05-atom—evaluation-lags-architecture → 05-molecule—rag-evaluation-as-diagnostic → 07-molecule—actionable-metrics-pattern