Chain-of-Verification
Overview
A four-step framework for reducing hallucinations in LLM outputs by having the model verify its own factual claims through self-generated questions.
Components
Step 1: Generate Baseline Response Query the model directly without special prompting. This produces an initial response that may contain hallucinations, the baseline to improve.
Step 2: Plan Verification Questions Given the original query and baseline response, prompt the model to generate questions that would test the factual claims made. These are not templated; the model phrases them naturally.
Example: If the baseline states “The Mexican-American War was from 1846 to 1848,” a verification question might be “When did the Mexican-American War start and end?”
Step 3: Execute Verification Answer each verification question independently (critically, without the baseline response in context. This isolation prevents the model from simply confirming its prior assertions. Answers should be factual, not influenced by what was claimed before.
Step 4: Generate Final Verified Response Cross-check baseline claims against verification answers. Discard inconsistent facts. Regenerate the response using only verified information.
When to Use
Chain-of-Verification is particularly effective for:
- List-based questions (naming items in a category)
- Closed-book QA (factual questions without retrieval)
- Long-form generation (biographies, explanations)
- Any task where factual accuracy matters more than fluency
Limitations
- Multiple LLM calls increase latency and cost
- Effectiveness depends on model’s ability to generate good verification questions
- Does not address hallucinations beyond factual inaccuracy (e.g., logical errors)
- Yes/no verification questions perform worse than open-ended ones (models tend to agree with stated facts regardless of accuracy)
Performance
CoVe improves F1 scores by up to 23% on closed-book QA. For long-form generation, CoVe-based Llama outperformed InstructGPT, ChatGPT, and Perplexity AI on factual accuracy. At least 10% gains over Chain-of-Thought for context-free QA tasks.
Related: 05-atom—shortform-accuracy-advantage, 05-molecule—metacognitive-prompting, 05-molecule—chain-of-thought-prompting