Exemplar Design Principles

The Principle

The quality of examples in few-shot prompts determines the ceiling of model performance. Well-designed exemplars can improve accuracy by up to 90% compared to poorly chosen ones. Exemplar design is a high-leverage intervention.

Why This Matters

In-Context Learning works by pattern matching against provided examples. The model extracts regularities from your exemplars and applies them to new inputs. If your examples are noisy, biased, or poorly formatted, the model learns and reproduces those flaws.

This isn’t about “teaching” the model in any deep sense, it’s about surfacing the right patterns from the model’s existing capabilities. Good exemplars create favorable conditions for the model to succeed.

How to Apply

Optimize Quantity: More examples generally help, but with diminishing returns. Start with 3-5 high-quality examples before adding more. Context window limits create hard ceilings.

Consider Order: Place diverse examples throughout, not clustered by type. The model may overweight examples near the end (recency bias). Alternating patterns often work well.

Balance Labels: If you’re classifying into categories, represent categories proportionally to their expected distribution. Overrepresented categories get overselected.

Prioritize Quality: One clear, accurate example beats three ambiguous ones. Remove or fix noisy examples rather than hoping quantity compensates.

Maintain Consistent Format: Use identical structure across all examples. The model learns format as part of the pattern. Inconsistency introduces noise.

Match Similarity: Select examples semantically close to your target inputs. Retrieval-augmented approaches (selecting examples dynamically based on input similarity) often outperform static example sets.

Show the Work: For reasoning tasks, include intermediate steps, not just input-output pairs. Chain-of-Thought exemplars outperform simple exemplars on complex tasks.

When This Especially Matters

  • Classification tasks where edge cases matter
  • Generation tasks requiring specific styles or formats
  • Reasoning tasks where the path to the answer matters as much as the answer itself
  • Production systems where consistent performance is critical

Common Mistakes

  • Using too-similar examples that don’t capture task variety
  • Copying examples from documentation without validating relevance
  • Ignoring example order as if it were arbitrary
  • Assuming more examples always improve performance
  • Failing to iterate on example selection as you learn what the model struggles with

Related: 05-molecule—prompting-technique-taxonomy, 05-atom—in-context-learning-definition, 05-atom—few-shot-cot-superiority, 05-atom—prompt-sensitivity-problem