The Capability-Alignment Gap
The Principle
Supervised fine-tuning creates a dangerous mismatch: models learn to respond confidently to queries that exceed their actual knowledge, because the training process rewards completing responses rather than expressing appropriate uncertainty.
Why This Matters
Pre-training establishes what a model knows. SFT teaches the model how to interact. The problem emerges when SFT examples demand knowledge the model doesn’t have.
A model that encountered “What is the capital of France?” during pre-training and “Paris” in SFT learns a useful pattern. A model that never encountered specific medical knowledge during pre-training but gets trained on medical Q&A examples learns a dangerous pattern: generate confident medical-sounding responses regardless of actual knowledge.
The training signal doesn’t distinguish between “correctly recalling knowledge” and “successfully completing the response format.” Both look like low loss.
The Compounding Problem
This interacts with sycophancy from RLHF. Not only does the model lack knowledge boundaries, it’s actively rewarded for avoiding expressions of uncertainty. “I don’t know” generates lower reward than a confident (potentially wrong) answer.
The result: Models that will confidently answer almost anything, with no internal signal distinguishing known from unknown territory.
How to Apply
In fine-tuning:
- Audit SFT data for queries that exceed pre-training knowledge scope
- Include explicit “I don’t know” examples in training data
- Train refusal behaviors for recognized knowledge gaps
- Consider knowledge boundaries when selecting instruction domains
In deployment:
- Don’t assume model confidence correlates with accuracy
- Design interfaces that surface uncertainty where possible
- Consider retrieval augmentation for knowledge-intensive queries
- Build verification loops for high-stakes outputs
In evaluation:
- Test specifically at knowledge boundaries
- Measure calibration, not just accuracy
- Check for appropriate refusal behavior on out-of-scope queries
When This Especially Matters
- Domain-specific deployments (medical, legal, financial) where pre-training gaps are predictable
- User-facing systems where confident wrong answers cause harm
- Fine-tuning projects where instruction data scope isn’t carefully matched to base model knowledge
- Any application where “I don’t know” is a valid and preferable response
Exceptions and Nuances
The gap isn’t always problematic:
- Creative tasks may benefit from confident generation beyond “knowledge”
- Reasoning tasks can involve valid extrapolation from known facts
- Some SFT domains genuinely activate and structure existing knowledge
The danger is specifically with factual claims presented confidently when the underlying knowledge isn’t there.
Related: 05-atom—knowledge-boundary-problem, 05-atom—sycophancy-behavior, 05-molecule—hallucination-causes-lifecycle, 05-molecule—llm-hallucination-taxonomy