Semantic Similarity Rating (SSR) Framework

Overview

A two-stage approach for extracting structured measurements from LLM responses without losing the richness of natural language output.

Stage 1: Textual Elicitation Prompt the LLM to respond naturally to a question. No constraints on format. Let it express nuance, hedging, and reasoning.

Stage 2: Embedding-Based Mapping Convert the text response to a structured format by measuring semantic similarity to pre-defined reference anchors.

Components

Reference Anchor Sets Pre-written statements that exemplify each point on your target scale. For purchase intent:

“It’s very unlikely I’d buy it” → 1
“I’m unsure either way” → 3
“I’d definitely buy it” → 5

Multiple anchor sets can be averaged to reduce sensitivity to phrasing.

Embedding Model Any text embedding model that captures semantic similarity. The paper used OpenAI’s text-embedding-3-small; larger models showed no improvement.

Similarity Metric Cosine similarity between response embedding and each anchor embedding, converted to probability distribution.

Temperature Parameter (optional) Controls how “peaked” the resulting distribution is. T=1 works as reasonable default.

When to Use

Converting qualitative responses to quantitative metrics
Simulating survey data where direct numeric elicitation fails
Any task requiring structured output where constraining the generation degrades quality
Preserving uncertainty/ambiguity in responses rather than forcing point estimates

Limitations

Requires manual design of reference anchor statements
Results depend on anchor quality and coverage
Embedding space may not perfectly capture task-relevant semantic relationships
Works best in domains well-represented in embedding model training data

Validation Approach

Compare to human data using:

Distributional similarity (Kolmogorov-Smirnov test)
Ranking correlation (Pearson on means)
Correlation attainment (synthetic-human correlation / human test-retest reliability)

All three metrics matter. Good distributions with wrong rankings (or vice versa) indicates method failure.

Extensions

The pattern generalizes beyond Likert scales:

Sentiment classification with confidence
Multi-label tagging with probability weights
Any ordinal or categorical measurement where responses are inherently ambiguous

>heyMHK

Semantic Similarity Rating (SSR) Framework

Semantic Similarity Rating (SSR) Framework

Overview

Components

When to Use

Limitations

Validation Approach

Extensions

Properties

Graph view

Table of Contents

Backlinks