What Benchmarks Exclude

GDPval explicitly excludes several categories of knowledge work that remain difficult to evaluate:

Excluded from current scope:

Tasks requiring extensive tacit knowledge
Access to personally identifiable information
Use of proprietary software tools
Communication between individuals
Physical or manual labor
Interactive, multi-turn work processes

Also notable:

Tasks are precisely specified (real work often isn’t)
One-shot completion (real work involves iteration)
Full context provided upfront (real work requires discovery)

This list reveals the boundaries of current AI capability evaluation - and possibly of current AI capability itself. The excluded categories represent much of what makes professional work difficult: navigating ambiguity, leveraging tacit knowledge, coordinating with others, and iterating toward unclear goals.

The performance gap on under-contextualized versions of the same tasks (where context must be figured out rather than provided) suggests these exclusions aren’t just methodological conveniences - they’re pointing at real capability boundaries.

>heyMHK

What Benchmarks Exclude

What Benchmarks Exclude

Properties

Graph view