LLM Annotation Suitability Framework

Overview

A decision framework for determining whether LLM annotation is appropriate for a given task, based on empirical patterns from large-scale evaluation studies.

Components

Signal Type Assessment

Use LLMs when:

Target categories map to explicit textual signals (keywords, phrases, patterns)
Annotation can be determined from the text alone
Little external context or world knowledge required
The task is closer to extraction than interpretation

Avoid LLMs when:

Annotation requires inference about unstated meaning
Cultural, historical, or domain context is essential
The concept is “latent” (not directly observable in text
Human judgment typically varies even among experts

Inter-LLM Agreement Test

Before committing to LLM annotation at scale:

Run 3–5 different LLMs on a sample (n=100–500)
Calculate pairwise intercoder reliability (Krippendorff’s alpha, not simple agreement)
Interpret:
- α > 0.67: LLM annotation likely appropriate
- α 0.4–0.67: Proceed with caution, validate against expert sample
- α < 0.4: Task probably requires human annotation

High LLM-to-LLM agreement predicts high LLM-to-human agreement. Low LLM-to-LLM agreement signals fundamental ambiguity in the task.

Model Selection Criteria

Factor	Recommendation
Minimum size	12B parameters
Preferred size	70B+ when available
Open vs. proprietary	Open-weight for reproducibility
Reasoning models	No advantage for standard annotation

Validation Requirements

Even when LLMs appear suitable:

Validate against expert-coded sample (not crowd-sourced)
Use chance-corrected reliability metrics
Examine confusion matrix for systematic category errors
If using bias correction, budget for 600–1000 ground truth samples

When to Use

Scoping phase: deciding whether LLM annotation is viable
Model selection: choosing among available LLMs
Quality assurance: interpreting validation results
Reporting: justifying annotation methodology

Limitations

This framework assumes annotation tasks with discrete categories. Continuous annotation (e.g., probability scores, ratings) introduces additional concerns about LLM calibration not addressed here.

The inter-LLM agreement test adds cost and complexity. For low-stakes or exploratory analysis, it may be acceptable to skip this step while acknowledging the limitation.

Related: [None yet]

>heyMHK

LLM Annotation Suitability Framework

LLM Annotation Suitability Framework

Overview

Components

Signal Type Assessment

Inter-LLM Agreement Test

Model Selection Criteria

Validation Requirements

When to Use

Limitations

Properties

Graph view

Table of Contents

Backlinks