RAG Safety Evaluation Taxonomy

Overview

A framework for evaluating the safety of RAG systems across six dimensions. Unlike standalone LLM safety, RAG safety must account for vulnerabilities introduced by the retrieval component and corpus dynamics.

The Six Dimensions

1. Robustness

How the system behaves when retrieval returns misleading information.

Key metrics:

  • Resilience Rate: % of accurate responses maintained despite noisy context
  • Boost Rate: % of initially wrong answers corrected by retrieval
  • Misleading Rate: frequency of being misled by counterfactual documents

2. Factuality

Whether outputs are accurate and avoid hallucination.

Key metrics:

  • Hallucination Rate (often via LLM-as-judge)
  • Citation Accuracy (precision and recall)
  • Faithfulness to retrieved sources

3. Adversarial Resistance

Defense against targeted attacks on RAG components.

Attack vectors to evaluate:

  • Knowledge poisoning (malicious corpus injection)
  • Retrieval hijacking (ranking manipulation)
  • Phantom attacks (trigger-activated documents)
  • Jamming (response refusal forcing)

Key metric: Attack Success Rate (ASR)

4. Privacy

Protection against information leakage.

Key metrics:

  • Extraction Success Rate
  • PII Leakage Rate
  • Membership Inference Attack Success

5. Fairness

Whether the system exhibits or amplifies biases.

Key metrics:

  • Bias metrics across demographic groups
  • Stereotype detection frequency
  • Counterfactual fairness (do outputs change when sensitive attributes change?)

6. Transparency

Ability to trace and verify reasoning.

Key metrics:

  • Explanation quality (human ratings)
  • Traceability to source documents
  • Citation accuracy

Application

Safety evaluation should occur at both component and system levels. A retriever that surfaces biased documents creates fairness issues even if the generator is unbiased. A generator that ignores retrieved context may be robust to poisoning but unfaithful.

Current State

Research indicates current defenses are insufficient against sophisticated attacks. Safety evaluation remains less mature than performance evaluation, a gap that matters more as RAG systems handle sensitive applications.

Related:, 04-atom—provenance-design