Human vs. Supervised vs. LLM Annotation
Thing A: Human Annotation
Workflow: Raw data → Codebook development → Coder training → Manual annotation (full dataset) → Validation/QA → Annotated data
Strengths:
- Handles nuance, context, and ambiguity
- Biases are relatively well-understood
- Can flag edge cases and codebook problems
- Established theory for predicting error patterns
Weaknesses:
- Expensive at scale
- Slow throughput
- Subject to fatigue, drift, and individual variation
- Still requires intercoder reliability checks
Thing B: Supervised Model Annotation
Workflow: Raw data → Codebook development → Manual annotation (training subset) → Model training → Automated annotation → Validation/QA → Annotated data
Strengths:
- Consistent once trained
- Scalable after initial investment
- Errors are systematic (easier to characterize)
- Can iterate on training data to improve
Weaknesses:
- Requires substantial labeled training data (avg. ~4000 samples)
- Training introduces researcher influence
- Model performance bounded by training data quality
- Less flexible to codebook changes
Thing C: LLM Annotation
Workflow: Raw data → Codebook development → Automated annotation (zero-shot or few-shot) → Validation/QA → Annotated data
Strengths:
- No manual annotation required
- Low marginal cost at scale
- Fast deployment
- Potentially reduces researcher influence (no training step)
Weaknesses:
- Opaque error patterns (black box)
- Biases harder to predict than human biases
- Creates hidden researcher degrees of freedom (model choice)
- Validation against ground truth still essential
Key Differences
| Dimension | Human | Supervised | LLM |
|---|---|---|---|
| Manual coding required | Full dataset | Training subset | None |
| Upfront cost | High | Medium | Low |
| Per-annotation cost | High | Low | Very low |
| Bias predictability | High | Medium | Low |
| Reproducibility | Moderate | High | Variable |
| Handles ambiguity | Well | Poorly | Poorly |
| Scale | Limited | Good | Excellent |
When Each Applies
Use human annotation when:
- Task requires judgment on latent concepts
- Stakes are high (errors have significant consequences)
- Dataset is small enough to be feasible
- Understanding error patterns matters
Use supervised models when:
- Task is well-defined with clear categories
- Sufficient training data is available or can be created
- Consistency at scale is paramount
- Iterative improvement is possible
Use LLMs when:
- Task has explicit textual signals
- Speed and cost are primary constraints
- Ground truth validation sample is available
- Conclusions will be tested for robustness to annotator choice