LLM Annotation Suitability Framework

Overview

A decision framework for determining whether LLM annotation is appropriate for a given task, based on empirical patterns from large-scale evaluation studies.

Components

Signal Type Assessment

Use LLMs when:

  • Target categories map to explicit textual signals (keywords, phrases, patterns)
  • Annotation can be determined from the text alone
  • Little external context or world knowledge required
  • The task is closer to extraction than interpretation

Avoid LLMs when:

  • Annotation requires inference about unstated meaning
  • Cultural, historical, or domain context is essential
  • The concept is “latent” (not directly observable in text
  • Human judgment typically varies even among experts

Inter-LLM Agreement Test

Before committing to LLM annotation at scale:

  1. Run 3–5 different LLMs on a sample (n=100–500)
  2. Calculate pairwise intercoder reliability (Krippendorff’s alpha, not simple agreement)
  3. Interpret:
    • α > 0.67: LLM annotation likely appropriate
    • α 0.4–0.67: Proceed with caution, validate against expert sample
    • α < 0.4: Task probably requires human annotation

High LLM-to-LLM agreement predicts high LLM-to-human agreement. Low LLM-to-LLM agreement signals fundamental ambiguity in the task.

Model Selection Criteria

FactorRecommendation
Minimum size12B parameters
Preferred size70B+ when available
Open vs. proprietaryOpen-weight for reproducibility
Reasoning modelsNo advantage for standard annotation

Validation Requirements

Even when LLMs appear suitable:

  • Validate against expert-coded sample (not crowd-sourced)
  • Use chance-corrected reliability metrics
  • Examine confusion matrix for systematic category errors
  • If using bias correction, budget for 600–1000 ground truth samples

When to Use

  • Scoping phase: deciding whether LLM annotation is viable
  • Model selection: choosing among available LLMs
  • Quality assurance: interpreting validation results
  • Reporting: justifying annotation methodology

Limitations

This framework assumes annotation tasks with discrete categories. Continuous annotation (e.g., probability scores, ratings) introduces additional concerns about LLM calibration not addressed here.

The inter-LLM agreement test adds cost and complexity. For low-stakes or exploratory analysis, it may be acceptable to skip this step while acknowledging the limitation.

Related: [None yet]