Elicitation Design Determines Output Quality

The Principle

How you ask an LLM to respond shapes what it can produce. Constraints intended to simplify outputs often degrade them. When structured output fails, the problem is often the elicitation method, not the model’s capability.

Why This Matters

The instinct when LLMs produce poor structured outputs is to assume model limitation. But the Maier et al. research demonstrates a different pattern: LLMs prompted for free-text responses + post-processing dramatically outperform LLMs constrained to produce structured outputs directly.

Direct Likert rating: models cluster around “safe” middle values Textual response + mapping: human-like distributions with discriminative signal

The constraint itself created the failure. Remove it, then map afterward, and the problem disappears.

How to Apply

Generate first, structure second Let the model respond in natural language, then convert to your target format via embedding similarity, classification, or extraction.

Provide rich context for discrimination Without personas or situational context, LLMs converge toward generic responses. Specificity in the prompt creates specificity in the output.

Validate against ceiling, not perfection Human data is noisy. Measure synthetic performance as percentage of achievable correlation, not raw accuracy.

Test distribution shape, not just point estimates Means can match while distributions diverge catastrophically. Good rankings with bad distributions (or vice versa) signals method failure.

When This Especially Matters

  • Any structured output task where direct generation produces degenerate results
  • Survey simulation and synthetic data generation
  • Classification tasks with ambiguous boundaries
  • Anywhere you’re tempted to blame the model for output quality issues

Exceptions

Some tasks genuinely require direct structured output (code generation, JSON extraction). The pattern applies most strongly where the target structure imposes artificial constraints on expression.

Broader Implication

This connects to the “UI as the Ultimate Guardrail” principle: interface design determines what capabilities can be expressed. The prompt is the interface. Poorly designed prompts mask model capability.

Related: 07-molecule—ui-as-ultimate-guardrail, 06-molecule—ssr-framework, 05-atom—uniform-confidence-problem