Tool Delegation Pattern

Context

LLMs can decompose complex problems into logical steps but frequently make arithmetic errors when executing calculations. Even when the reasoning chain is correct, the final answer may be wrong due to computational mistakes.

Problem

How do you get accurate results from reasoning tasks that involve numerical computation, when the language model reliably generates correct reasoning but unreliably executes calculations?

Solution

Separate reasoning from computation. Have the LLM generate executable code (typically Python) as its reasoning output, then delegate execution to an external interpreter. The model handles what it does well (understanding the problem, decomposing it into steps); the interpreter handles what it does better (arithmetic, symbolic manipulation).

Program-of-Thoughts (PoT): The model generates a Python program that encodes the reasoning steps. The final answer comes from running the program, not from the model’s text generation.

Program-Aided Language Models (PAL): Similar approach, but the generated code interleaves natural language comments with executable statements, combining interpretability with computational accuracy.

Implementation

Q: In Fibonacci sequence, what is the 50th number?

# Standard CoT (error-prone):
"The first number is 0, the second is 1, so the third is 1, 
 fourth is 2, fifth is 3..." [continues with potential errors]

# Tool Delegation (PoT):
def fibonacci(n):
    a, b = 0, 1
    for _ in range(n - 1):
        a, b = b, a + b
    return a
print(fibonacci(50))
# → Execute in Python interpreter → 7778742049

Consequences

Benefits:

  • Eliminates arithmetic errors entirely (given correct code)
  • Scales to arbitrary computational complexity
  • Maintains full interpretability through code

Tradeoffs:

  • Requires models trained on code (though most modern LLMs qualify)
  • Introduces security considerations (executing generated code)
  • Limited to tasks expressible as computation

Performance: Program-of-Thoughts shows average 12% improvement over Chain-of-Thought on numerical tasks. PAL achieves 90%+ accuracy on some reasoning benchmarks.

Related: 05-molecule—chain-of-thought-prompting, 05-atom—llm-decomposition-vs-computation