Human vs. Supervised vs. LLM Annotation

Thing A: Human Annotation

Workflow: Raw data → Codebook development → Coder training → Manual annotation (full dataset) → Validation/QA → Annotated data

Strengths:

  • Handles nuance, context, and ambiguity
  • Biases are relatively well-understood
  • Can flag edge cases and codebook problems
  • Established theory for predicting error patterns

Weaknesses:

  • Expensive at scale
  • Slow throughput
  • Subject to fatigue, drift, and individual variation
  • Still requires intercoder reliability checks

Thing B: Supervised Model Annotation

Workflow: Raw data → Codebook development → Manual annotation (training subset) → Model training → Automated annotation → Validation/QA → Annotated data

Strengths:

  • Consistent once trained
  • Scalable after initial investment
  • Errors are systematic (easier to characterize)
  • Can iterate on training data to improve

Weaknesses:

  • Requires substantial labeled training data (avg. ~4000 samples)
  • Training introduces researcher influence
  • Model performance bounded by training data quality
  • Less flexible to codebook changes

Thing C: LLM Annotation

Workflow: Raw data → Codebook development → Automated annotation (zero-shot or few-shot) → Validation/QA → Annotated data

Strengths:

  • No manual annotation required
  • Low marginal cost at scale
  • Fast deployment
  • Potentially reduces researcher influence (no training step)

Weaknesses:

  • Opaque error patterns (black box)
  • Biases harder to predict than human biases
  • Creates hidden researcher degrees of freedom (model choice)
  • Validation against ground truth still essential

Key Differences

DimensionHumanSupervisedLLM
Manual coding requiredFull datasetTraining subsetNone
Upfront costHighMediumLow
Per-annotation costHighLowVery low
Bias predictabilityHighMediumLow
ReproducibilityModerateHighVariable
Handles ambiguityWellPoorlyPoorly
ScaleLimitedGoodExcellent

When Each Applies

Use human annotation when:

  • Task requires judgment on latent concepts
  • Stakes are high (errors have significant consequences)
  • Dataset is small enough to be feasible
  • Understanding error patterns matters

Use supervised models when:

  • Task is well-defined with clear categories
  • Sufficient training data is available or can be created
  • Consistency at scale is paramount
  • Iterative improvement is possible

Use LLMs when:

  • Task has explicit textual signals
  • Speed and cost are primary constraints
  • Ground truth validation sample is available
  • Conclusions will be tested for robustness to annotator choice

Related: 03-molecule—annotation-suitability-framework