ReAct Benchmark Results

Empirical results from interleaving reasoning and acting in LLMs (PaLM-540B):

Failure mode comparison (HotpotQA, n=200):

  • Chain-of-thought hallucination rate: 56% of failures
  • Grounded reasoning hallucination rate: 0% of failures
  • Grounded reasoning error rate: 47% of failures (structural rigidity)
  • False positive rate: CoT 14% vs. grounded 6%

Task performance:

  • HotpotQA: Combined approach 35.1% EM vs. 33.4% CoT-SC alone
  • ALFWorld (household tasks): 71% success vs. 37% imitation learning baseline
  • WebShop (web navigation): 40% success vs. 28.7% IL+RL methods

Training efficiency:

  • Few-shot learning (1-6 examples) outperformed methods trained on 10³-10⁵ task instances

Related: 05-atom—hallucination-from-ungrounded-reasoning, 05-atom—reasoning-grounding-tradeoff