The Capability-Alignment Gap

The Principle

Supervised fine-tuning creates a dangerous mismatch: models learn to respond confidently to queries that exceed their actual knowledge, because the training process rewards completing responses rather than expressing appropriate uncertainty.

Why This Matters

Pre-training establishes what a model knows. SFT teaches the model how to interact. The problem emerges when SFT examples demand knowledge the model doesn’t have.

A model that encountered “What is the capital of France?” during pre-training and “Paris” in SFT learns a useful pattern. A model that never encountered specific medical knowledge during pre-training but gets trained on medical Q&A examples learns a dangerous pattern: generate confident medical-sounding responses regardless of actual knowledge.

The training signal doesn’t distinguish between “correctly recalling knowledge” and “successfully completing the response format.” Both look like low loss.

The Compounding Problem

This interacts with sycophancy from RLHF. Not only does the model lack knowledge boundaries, it’s actively rewarded for avoiding expressions of uncertainty. “I don’t know” generates lower reward than a confident (potentially wrong) answer.

The result: Models that will confidently answer almost anything, with no internal signal distinguishing known from unknown territory.

How to Apply

In fine-tuning:

Audit SFT data for queries that exceed pre-training knowledge scope
Include explicit “I don’t know” examples in training data
Train refusal behaviors for recognized knowledge gaps
Consider knowledge boundaries when selecting instruction domains

In deployment:

Don’t assume model confidence correlates with accuracy
Design interfaces that surface uncertainty where possible
Consider retrieval augmentation for knowledge-intensive queries
Build verification loops for high-stakes outputs

In evaluation:

Test specifically at knowledge boundaries
Measure calibration, not just accuracy
Check for appropriate refusal behavior on out-of-scope queries

When This Especially Matters

Domain-specific deployments (medical, legal, financial) where pre-training gaps are predictable
User-facing systems where confident wrong answers cause harm
Fine-tuning projects where instruction data scope isn’t carefully matched to base model knowledge
Any application where “I don’t know” is a valid and preferable response

Exceptions and Nuances

The gap isn’t always problematic:

Creative tasks may benefit from confident generation beyond “knowledge”
Reasoning tasks can involve valid extrapolation from known facts
Some SFT domains genuinely activate and structure existing knowledge

The danger is specifically with factual claims presented confidently when the underlying knowledge isn’t there.

>heyMHK

The Capability-Alignment Gap

The Capability-Alignment Gap

The Principle

Why This Matters

The Compounding Problem

How to Apply

When This Especially Matters

Exceptions and Nuances

Properties

Graph view

Table of Contents

Backlinks