Model Strengths Differ by Modality

Different frontier models excel at different aspects of knowledge work deliverables.

In the GDPval benchmark:

  • Claude Opus 4.1 excelled on aesthetics: document formatting, slide layout, visual file types (.pdf, .xlsx, .pptx)
  • GPT-5 excelled on accuracy: following instructions precisely, performing correct calculations, pure text outputs

The pattern extends to file types:

  • Claude achieved best results for all deliverable types except pure text
  • GPT-5 led for pure text outputs only
  • Overall win rates remain low across all file types

This suggests that “which model is best” depends heavily on the nature of the work. Tasks requiring visual polish may benefit from different model choices than tasks requiring analytical precision. The practical implication: model selection should be task-type aware, not one-size-fits-all.

Related: 05-atom—model-failure-mode-distribution