Automated Grader Agreement Ceiling
Automated AI graders achieve 66% agreement with human expert graders on complex work quality assessments - only 5 percentage points below human inter-rater agreement of 71%.
This finding has methodological implications:
- The gap between AI and human grading is smaller than the inherent disagreement among human experts
- Self-bias exists: the automated grader (based on GPT-5) shows lower correlation with human experts when assessing capable OpenAI model outputs
- Both agreement metrics are highest for less capable models, whose outputs are easier to distinguish from human work
The narrow gap suggests automated grading may be viable for many evaluation use cases, but the self-bias finding is a significant caveat. Using the same model family for both task completion and quality assessment introduces systematic bias that inflates apparent performance.
Related: [None yet]