Semantic Similarity Rating (SSR) Framework
Overview
A two-stage approach for extracting structured measurements from LLM responses without losing the richness of natural language output.
Stage 1: Textual Elicitation Prompt the LLM to respond naturally to a question. No constraints on format. Let it express nuance, hedging, and reasoning.
Stage 2: Embedding-Based Mapping Convert the text response to a structured format by measuring semantic similarity to pre-defined reference anchors.
Components
Reference Anchor Sets Pre-written statements that exemplify each point on your target scale. For purchase intent:
- “It’s very unlikely I’d buy it” → 1
- “I’m unsure either way” → 3
- “I’d definitely buy it” → 5
Multiple anchor sets can be averaged to reduce sensitivity to phrasing.
Embedding Model Any text embedding model that captures semantic similarity. The paper used OpenAI’s text-embedding-3-small; larger models showed no improvement.
Similarity Metric Cosine similarity between response embedding and each anchor embedding, converted to probability distribution.
Temperature Parameter (optional) Controls how “peaked” the resulting distribution is. T=1 works as reasonable default.
When to Use
- Converting qualitative responses to quantitative metrics
- Simulating survey data where direct numeric elicitation fails
- Any task requiring structured output where constraining the generation degrades quality
- Preserving uncertainty/ambiguity in responses rather than forcing point estimates
Limitations
- Requires manual design of reference anchor statements
- Results depend on anchor quality and coverage
- Embedding space may not perfectly capture task-relevant semantic relationships
- Works best in domains well-represented in embedding model training data
Validation Approach
Compare to human data using:
- Distributional similarity (Kolmogorov-Smirnov test)
- Ranking correlation (Pearson on means)
- Correlation attainment (synthetic-human correlation / human test-retest reliability)
All three metrics matter. Good distributions with wrong rankings (or vice versa) indicates method failure.
Extensions
The pattern generalizes beyond Likert scales:
- Sentiment classification with confidence
- Multi-label tagging with probability weights
- Any ordinal or categorical measurement where responses are inherently ambiguous
Related:, 07-molecule—elicitation-design-principle, 07-molecule—vectors-vs-graphs