What Benchmarks Exclude
GDPval explicitly excludes several categories of knowledge work that remain difficult to evaluate:
Excluded from current scope:
- Tasks requiring extensive tacit knowledge
- Access to personally identifiable information
- Use of proprietary software tools
- Communication between individuals
- Physical or manual labor
- Interactive, multi-turn work processes
Also notable:
- Tasks are precisely specified (real work often isn’t)
- One-shot completion (real work involves iteration)
- Full context provided upfront (real work requires discovery)
This list reveals the boundaries of current AI capability evaluation - and possibly of current AI capability itself. The excluded categories represent much of what makes professional work difficult: navigating ambiguity, leveraging tacit knowledge, coordinating with others, and iterating toward unclear goals.
The performance gap on under-contextualized versions of the same tasks (where context must be figured out rather than provided) suggests these exclusions aren’t just methodological conveniences - they’re pointing at real capability boundaries.
Related:, 05-atom—context-specification-gap, 06-molecule—seci-framework