Prompt Sensitivity Problem

Small, seemingly insignificant changes in prompt wording can cause dramatically different outputs from the same model.

This sensitivity manifests across multiple dimensions:

Phrasing Changes: Synonymous instructions (“list” vs “enumerate”) may yield substantially different results. The model isn’t reasoning about meaning, it’s pattern-matching against training distributions.

Format Changes: Switching between question formats (open-ended vs multiple choice) changes performance, sometimes unpredictably.

Prompt Drift: The same prompt may perform differently over time as model versions update. What worked yesterday may not work tomorrow. Production systems can silently degrade.

The implications are uncomfortable: prompt performance is often more fragile than it appears. Techniques that work in one context may fail in another. This argues for systematic testing across variations rather than relying on single successful examples.

Prompt sensitivity also suggests that prompting success sometimes reflects luck in hitting favorable patterns rather than principled design.

Related: 05-atom—prompting-vs-prompt-engineering, 05-atom—ai-vs-human-prompt-optimization