Bias Correction Requires 600–1000 Ground Truth Samples

Methods to correct for measurement error in LLM annotations (DSL, PRISA) can reduce bias in downstream estimates, but only with substantial ground truth data.

At smaller sample sizes (200–400), bias correction methods can be counterproductive, increasing rather than reducing error. The methods begin reliably outperforming naive estimates around 600–1000 ground truth samples.

This creates a tension: the promise of LLM annotation is avoiding manual coding, but robust use requires a significant manually-coded validation set anyway.

The bias correction also inflates standard errors (4–10× larger than naive estimates with small ground truth samples), creating a bias-variance tradeoff. Coverage improves (from ~77% to ~95% at the 0.05 level), but at the cost of statistical power.

For tasks requiring aggregation (sentence-level to speaker-level), the required sample size is likely much larger.

Related: [None yet]