Correlation Attainment

A validation metric for synthetic data quality that measures synthetic-human correlation as a percentage of the theoretical maximum achievable correlation.

The problem it solves:

Raw correlation between synthetic and human survey results (r = 0.72) sounds mediocre. But what’s the ceiling? If human test-retest reliability is itself only r = 0.80, then 0.72 represents 90% attainment, nearly as good as humans agreeing with themselves.

Formula:

ρ = E[R_xy] / E[R_xx]

Where R_xy is correlation between synthetic and human data, and R_xx is correlation between split halves of human data (simulated test-retest).

Why it matters:

Human survey data is noisy. Narrow distributions (mean PI around 4.0 with std of only 0.1 across 57 surveys) create low ceilings for any prediction method. Correlation attainment normalizes against this inherent noise, revealing that synthetic methods are performing closer to human reliability than raw correlations suggest.

This framing applies to any synthetic data validation problem: benchmark against the ceiling, not against perfection.

Related:, 06-molecule—ssr-framework