Semantic Similarity Rating (SSR) Framework

Overview

A two-stage approach for extracting structured measurements from LLM responses without losing the richness of natural language output.

Stage 1: Textual Elicitation Prompt the LLM to respond naturally to a question. No constraints on format. Let it express nuance, hedging, and reasoning.

Stage 2: Embedding-Based Mapping Convert the text response to a structured format by measuring semantic similarity to pre-defined reference anchors.

Components

Reference Anchor Sets Pre-written statements that exemplify each point on your target scale. For purchase intent:

  • “It’s very unlikely I’d buy it” → 1
  • “I’m unsure either way” → 3
  • “I’d definitely buy it” → 5

Multiple anchor sets can be averaged to reduce sensitivity to phrasing.

Embedding Model Any text embedding model that captures semantic similarity. The paper used OpenAI’s text-embedding-3-small; larger models showed no improvement.

Similarity Metric Cosine similarity between response embedding and each anchor embedding, converted to probability distribution.

Temperature Parameter (optional) Controls how “peaked” the resulting distribution is. T=1 works as reasonable default.

When to Use

  • Converting qualitative responses to quantitative metrics
  • Simulating survey data where direct numeric elicitation fails
  • Any task requiring structured output where constraining the generation degrades quality
  • Preserving uncertainty/ambiguity in responses rather than forcing point estimates

Limitations

  • Requires manual design of reference anchor statements
  • Results depend on anchor quality and coverage
  • Embedding space may not perfectly capture task-relevant semantic relationships
  • Works best in domains well-represented in embedding model training data

Validation Approach

Compare to human data using:

  • Distributional similarity (Kolmogorov-Smirnov test)
  • Ranking correlation (Pearson on means)
  • Correlation attainment (synthetic-human correlation / human test-retest reliability)

All three metrics matter. Good distributions with wrong rankings (or vice versa) indicates method failure.

Extensions

The pattern generalizes beyond Likert scales:

  • Sentiment classification with confidence
  • Multi-label tagging with probability weights
  • Any ordinal or categorical measurement where responses are inherently ambiguous

Related:, 07-molecule—elicitation-design-principle, 07-molecule—vectors-vs-graphs