Stress-Testing HCI Research Against Technical Constraints

Context

Human-AI interaction research often assumes idealized technical conditions: well-calibrated confidence, accurate explanations, reliable outputs. Real deployed systems frequently violate these assumptions.

Problem

Research findings about “how users respond to AI” may not generalize to production systems where AI behaves differently than the idealized experimental versions. Recommendations built on idealized assumptions may backfire in deployment.

Solution

Explicitly identify and test against realistic technical limitations:

  1. Catalog the assumptions: What properties must the AI have for your design recommendations to work? (calibrated confidence, accurate explanations, consistent behavior, etc.)

  2. Research the prevalence: How often do real systems violate these assumptions? The Li et al. study notes that many ML algorithms produce miscalibrated confidence, this isn’t an edge case.

  3. Test degraded conditions: Run studies where the AI violates the ideal assumptions. What happens to user behavior when confidence is systematically wrong? When explanations are inaccurate?

  4. Design for robustness: Prefer interventions that work even when technical ideals aren’t met, or that fail gracefully rather than creating opposite problems.

Consequences

Benefits:

  • Research findings more likely to transfer to deployment
  • Earlier identification of failure modes
  • More honest assessment of intervention effectiveness
  • Bridges gap between HCI and ML research communities

Costs:

  • Increased study complexity
  • May produce messier, less clean findings
  • Requires understanding of technical systems, not just human factors

Examples

Idealized AssumptionRealistic ConstraintResearch Implication
Calibrated confidenceSystematic over/underconfidenceTest trust/reliance under miscalibration
Accurate explanationsPlausible but incorrect rationalesTest explanation comprehension with wrong explanations
Consistent outputsSame input → different outputsTest user adaptation to inconsistency

Related: [None yet]