What Benchmarks Exclude

GDPval explicitly excludes several categories of knowledge work that remain difficult to evaluate:

Excluded from current scope:

  • Tasks requiring extensive tacit knowledge
  • Access to personally identifiable information
  • Use of proprietary software tools
  • Communication between individuals
  • Physical or manual labor
  • Interactive, multi-turn work processes

Also notable:

  • Tasks are precisely specified (real work often isn’t)
  • One-shot completion (real work involves iteration)
  • Full context provided upfront (real work requires discovery)

This list reveals the boundaries of current AI capability evaluation - and possibly of current AI capability itself. The excluded categories represent much of what makes professional work difficult: navigating ambiguity, leveraging tacit knowledge, coordinating with others, and iterating toward unclear goals.

The performance gap on under-contextualized versions of the same tasks (where context must be figured out rather than provided) suggests these exclusions aren’t just methodological conveniences - they’re pointing at real capability boundaries.

Related:, 05-atom—context-specification-gap, 06-molecule—seci-framework