Performance Feedback Spectrum

Overview

A framework for understanding the different levels of performance information that AI systems can communicate to users, and the distinct effects each level produces.

The Spectrum

LevelWhat’s CommunicatedEffect on PerformanceEffect on Trust Calibration
NoneJust the predictionBaselinePoor (users can’t assess reliability
Overall Accuracy”This model is 80% accurate”Modest improvementMinimal, too abstract for case-specific judgments
Confidence Score”87% confident in this prediction”Significant improvementModerate, helps distinguish high/low certainty cases
Contextual Awareness”For similar cases, accuracy is 92%“Significant improvementBest, provides case-relevant calibration signal

Why the Levels Differ

Overall accuracy is too coarse. Knowing a model is “80% accurate” doesn’t help you decide whether this specific case falls in the 80% or the 20%.

Confidence scores are mathematically derived but not grounded in interpretable reasoning. They help users differentiate but don’t explain why a case might be uncertain.

Contextual awareness provides what confidence scores lack: a reference class. “The model struggles with cases like yours” is qualitatively different information than “the model is 73% confident.”

When to Use Which Level

ScenarioRecommended LevelRationale
Low-stakes, high-volume decisionsConfidence ScoreFast, good enough
High-stakes, human-override expectedContextual AwarenessCalibration matters more than speed
Regulatory/audit requirementsContextual Awareness + OverallExplainability requirements
User population struggles with probabilitiesContextual AwarenessReference classes more intuitive than percentages

Limitations

  • The study tested these levels in an income prediction task; generalization to other domains is assumed but not proven
  • Contextual awareness requires additional infrastructure (flaw detection model)
  • The differences between confidence and awareness, while statistically significant, were modest in magnitude

Design Implications

If your AI system only reports confidence scores, you’re leaving calibration on the table. The marginal cost of adding contextual awareness may be worth the marginal improvement in human-AI teaming outcomes, especially in domains where overconfident-wrong predictions are costly.

Related: 05-molecule—self-assessing-ai-pattern, 07-molecule—ui-as-ultimate-guardrail, 05-atom—uniform-confidence-problem