Performance Feedback Spectrum
Overview
A framework for understanding the different levels of performance information that AI systems can communicate to users, and the distinct effects each level produces.
The Spectrum
| Level | What’s Communicated | Effect on Performance | Effect on Trust Calibration |
|---|---|---|---|
| None | Just the prediction | Baseline | Poor (users can’t assess reliability |
| Overall Accuracy | ”This model is 80% accurate” | Modest improvement | Minimal, too abstract for case-specific judgments |
| Confidence Score | ”87% confident in this prediction” | Significant improvement | Moderate, helps distinguish high/low certainty cases |
| Contextual Awareness | ”For similar cases, accuracy is 92%“ | Significant improvement | Best, provides case-relevant calibration signal |
Why the Levels Differ
Overall accuracy is too coarse. Knowing a model is “80% accurate” doesn’t help you decide whether this specific case falls in the 80% or the 20%.
Confidence scores are mathematically derived but not grounded in interpretable reasoning. They help users differentiate but don’t explain why a case might be uncertain.
Contextual awareness provides what confidence scores lack: a reference class. “The model struggles with cases like yours” is qualitatively different information than “the model is 73% confident.”
When to Use Which Level
| Scenario | Recommended Level | Rationale |
|---|---|---|
| Low-stakes, high-volume decisions | Confidence Score | Fast, good enough |
| High-stakes, human-override expected | Contextual Awareness | Calibration matters more than speed |
| Regulatory/audit requirements | Contextual Awareness + Overall | Explainability requirements |
| User population struggles with probabilities | Contextual Awareness | Reference classes more intuitive than percentages |
Limitations
- The study tested these levels in an income prediction task; generalization to other domains is assumed but not proven
- Contextual awareness requires additional infrastructure (flaw detection model)
- The differences between confidence and awareness, while statistically significant, were modest in magnitude
Design Implications
If your AI system only reports confidence scores, you’re leaving calibration on the table. The marginal cost of adding contextual awareness may be worth the marginal improvement in human-AI teaming outcomes, especially in domains where overconfident-wrong predictions are costly.
Related: 05-molecule—self-assessing-ai-pattern, 07-molecule—ui-as-ultimate-guardrail, 05-atom—uniform-confidence-problem