LLM Selection as Hidden Researcher Degrees of Freedom

The Principle

The choice of which LLM to use for annotation is a consequential analytical decision that often goes unreported, creating new opportunities for, intentional or not, result-seeking behavior.

Why This Matters

Traditional concerns about researcher degrees of freedom focus on choices like variable operationalization, model specification, and sample selection. LLM-based annotation adds a new dimension: the proliferation of models means researchers can (consciously or not) select an LLM that produces their preferred result.

Different LLMs produce systematically different annotations from identical prompts and data. These differences aren’t random, they reflect different biases encoded during training. When these biased annotations feed downstream analysis, the choice of LLM can determine whether results are significant, null, or reversed.

How to Apply

Document and justify LLM selection. Pre-registration should include the specific model (including version), the rationale for selection, and commitment to that choice.

Test sensitivity across models. Before finalizing analysis, run the annotation with multiple LLMs to understand how robust conclusions are to annotator choice.

Prefer open-weight models. Proprietary models can be updated without notice, making exact replication impossible. Open-weight models with version pinning support reproducibility.

Report inter-LLM variability. If running multiple models, report the range of estimates obtained, not just the one that made it into the paper.

When This Especially Matters

  • When annotation determines the key variable (independent or dependent)
  • When effect sizes are small and sensitive to measurement
  • When the annotation task involves subjective judgment
  • In high-stakes research contexts where replication matters

Exceptions and Limits

For highly structured extraction tasks with clear right answers (named entity recognition, explicit fact extraction), model choice matters less. The concern is greatest for tasks requiring interpretation.

Related: [None yet]