Li et al. 2024 - Understanding the Effects of Miscalibrated AI Confidence

Citation

Li, J., Yang, Y., Zhang, R., Liao, Q. V., Song, T., Xu, Z., & Lee, Y. (2024). Understanding the Effects of Miscalibrated AI Confidence on User Trust, Reliance, and Decision Efficacy. arXiv:2402.07632.

Core Question

When AI confidence scores don’t accurately reflect correctness likelihood, what happens to user trust, reliance, and decision quality, and can transparency about miscalibration help?

Framing Analysis

The authors position this as addressing a gap in HCI research: prior studies assume AI confidence is well-calibrated, but real-world systems are often miscalibrated. The framing itself reveals how human factors research often builds on idealized technical assumptions that don’t hold in deployment.

Key Findings

Experiment 1 (N=126)

  • Users cannot detect miscalibration: Most participants rated both overconfident and underconfident AI as “well-calibrated”
  • Overconfident AI → over-reliance: Users switched to AI advice more often, including incorrect advice
  • Underconfident AI → under-reliance: Users ignored correct AI advice more often
  • Both directions harm decision efficacy: Accuracy gains from AI collaboration decreased with miscalibrated systems
  • Trust levels unchanged: Miscalibration didn’t affect stated trust, users couldn’t perceive the problem

Experiment 2 (N=126)

  • Transparency helps detection: Telling users about calibration levels helped them recognize miscalibration
  • Transparency reduces trust: Users trusted uncalibrated AI less when informed it was uncalibrated
  • But creates under-reliance: Informed users under-relied on both overconfident AND underconfident AI
  • No efficacy improvement: Knowing about miscalibration didn’t improve decision outcomes

Transferable Insights

  1. Transparency can trade one problem for another (over-reliance → under-reliance)
  2. User awareness doesn’t automatically enable appropriate action
  3. Stated confidence is not uncertainty visibility
  4. Miscalibration creates asymmetric failure modes

Methodological Notes

  • Simulated AI with controlled accuracy (70%) and confidence levels (60%, 70%, 80%)
  • City image recognition task (minimal domain expertise required)
  • Between-subjects design across calibration conditions
  • Measured trust (attitude), reliance (behavior), and decision efficacy (accuracy gain)

Connections

Extraction Status

  • Source file created
  • Atoms extracted
  • Molecules created
  • Organism drafted