LLM Oracles Match 20% Error Rate Human Experts

In ontology alignment tasks, LLM-based oracles (Gemini Flash 2.5, GPT-4o Mini) achieved diagnostic performance comparable to simulated human domain experts with a 20% error rate.

The best LLM configuration achieved a Youden’s Index of ~0.55 on average across nine biomedical ontology matching tasks. A perfect oracle scores 1.0; the LLM results showed no statistically significant difference from a 20% error rate human expert (p > 0.1 two-sided).

This suggests LLMs can serve as a scalable substitute for domain experts in specific validation tasks, not perfect, but good enough for practical utility at dramatically lower cost.

Related: 05-atom—llm-validation-cost-per-thousand, 05-molecule—targeted-llm-intervention-pattern, 01-atom—human-in-the-loop