Model Failure Mode Distribution

When frontier AI models lose to human experts on knowledge work tasks, the primary failure mode is instruction-following, not accuracy.

Analysis of expert justifications for preferring human deliverables shows:

  • Instruction-following failures: 14-40% of losses (varies by model)
  • Formatting errors: 5-10% of losses
  • Accuracy errors: 5-7% of losses

Claude, Grok, and Gemini most often lost due to instruction-following failures. GPT-5 lost mainly from formatting errors and had the fewest instruction-following issues. Gemini and Grok frequently promised deliverables but failed to provide them, ignored reference data, or used wrong formats.

This pattern suggests that much of the remaining gap between AI and expert performance isn’t about knowledge or reasoning - it’s about following specifications reliably. The bottleneck is compliance, not competence.

Related: 05-atom—context-specification-gap, 07-molecule—ui-as-ultimate-guardrail, 05-atom—model-strengths-by-modality