>heyMHK

digital garden

Domains

Notes

select a domain

❯

❯

❯

ReAct Benchmark Results

ReAct Benchmark Results

Jan 02, 20261 min read

ReAct Benchmark Results

Empirical results from interleaving reasoning and acting in LLMs (PaLM-540B):

Failure mode comparison (HotpotQA, n=200):

Chain-of-thought hallucination rate: 56% of failures
Grounded reasoning hallucination rate: 0% of failures
Grounded reasoning error rate: 47% of failures (structural rigidity)
False positive rate: CoT 14% vs. grounded 6%

Task performance:

HotpotQA: Combined approach 35.1% EM vs. 33.4% CoT-SC alone
ALFWorld (household tasks): 71% success vs. 37% imitation learning baseline
WebShop (web navigation): 40% success vs. 28.7% IL+RL methods

Training efficiency:

Few-shot learning (1-6 examples) outperformed methods trained on 10³-10⁵ task instances

Related: 05-atom—hallucination-from-ungrounded-reasoning, 05-atom—reasoning-grounding-tradeoff

Properties

typeatom

subtypestatistic

domain05-ai-mechanisms

createdJan 2, 2026

Graph view

Created with Quartz v4.5.2 © 2026

hey-mhk.com