PN: Tree of Thoughts — Deliberate Multi-Path Reasoning for LLMs

Core Idea

Tree of Thoughts (ToT) (Yao et al., NeurIPS 2023) frames LLM problem-solving as a search over a tree of intermediate reasoning steps (“thoughts”). Unlike Chain-of-Thought which generates one linear path, ToT explores multiple candidate continuations at each step, evaluates them with an LLM judge, and applies classical search algorithms (BFS, DFS, beam search) to find the best path. This enables backtracking and deliberate planning — capabilities absent in autoregressive text generation.


Motivation: The Limitations of Chain-of-Thought

Standard CoT generates a single left-to-right reasoning trace. This fails when:

  1. Problems require exploration: multiple plausible solution paths exist and the optimal one is non-obvious
  2. Local decisions affect global outcomes: an early error propagates irreversibly
  3. Backtracking is needed: the current path is provably dead-end but CoT cannot reverse

ToT addresses all three by treating reasoning as search, not generation.


The Tree Structure

                    [Problem]
                   /    |    \
           [Thought A] [Thought B] [Thought C]
           /     \        |         /    \
         [A1]   [A2]   [B1]     [C1]  [C2]
          ✓      ✗      ✓        ✗     ✓
                        |              |
                       [B1a]         [C2a]  ← best path
  • Node: a partial solution state (intermediate thought)
  • Edge: one reasoning step (LLM generation from parent → child)
  • Evaluation: LLM assigns each node a score (sure / likely / impossible)
  • Search: BFS, DFS, or beam search traverses the tree to find the optimal leaf

Four Components

1. Thought Decomposition

Decide what constitutes one “thought” — the granularity of search steps:

  • Word-level: each thought = one word (useful for constrained generation)
  • Sentence-level: each thought = one reasoning sentence
  • Paragraph-level: each thought = a complete sub-problem solution

Finer granularity → more exploration, more tokens. PUMA uses paragraph-level for backlog analysis.

2. Thought Generation

Two strategies:

StrategyMethodWhen to Use
SampleGenerate k thoughts independently with temperatureCreative tasks; diverse exploration
ProposeGenerate k thoughts in one call with “Give k different approaches” promptAnalytical tasks; efficiency

3. State Evaluation

LLM evaluates each thought node:

Prompt: "Given problem [P] and partial solution [S], rate the likelihood 
         this path leads to a correct solution: sure / likely / impossible"

Alternatively: numerical score (0–10) for soft search priorities.

4. Search Algorithm

AlgorithmDescriptionBest For
BFSExpand all nodes at depth d before depth d+1Short horizons; guaranteed optimum
DFSExplore one path fully before backtrackingLong horizons; memory-constrained
Beam searchKeep top-b nodes at each depthBalance quality vs. compute
MCTSMonte Carlo rollouts for expected valueStochastic environments

Empirical Results (Yao et al., 2023)

BenchmarkCoT (GPT-4)CoT-SC (k=100)ToT (BFS b=5)
Game of 244%9%74%
Creative Writing+41% win rate vs. CoT
Mini Crosswords16%33%60%

Key insight: for the Game of 24 (mathematical puzzle requiring multi-step backtracking), ToT improves over CoT by 18.5×. This highlights ToT’s advantage precisely where sequential reasoning breaks down.


Cost Analysis

ToT is significantly more expensive than CoT:

MethodAPI calls per taskTokensRelative cost
CoT1~500
Self-Consistency (k=40)40~20k40×
ToT (BFS b=5, d=3)~50–200~25–100k50–200×

Compute Cost

ToT is 50–200× more expensive than standard CoT. It is impractical for high-volume inference (e.g., PUMA Stage 1 bulk triage of 10k+ issues). Reserve for low-volume, high-stakes decisions.


PUMA Application

Where ToT Applies: Stage 3 — Backlog Prioritisation

Scenario: Given a sprint backlog of 15 issues, determine the optimal 5-issue sprint plan considering dependencies, priority, and team capacity.

This is a search problem: multiple orderings exist, and the optimal one requires evaluating combinations — exactly what ToT is designed for.

def tot_sprint_plan(issues: list[dict], capacity: int, k: int = 3) -> list[dict]:
    """ToT-based sprint planning: BFS over candidate sprint subsets."""
    candidates = []
    
    # Thought generation: propose k alternative 5-issue subsets
    for _ in range(k):
        subset = llm.propose(
            f"Select 5 issues from {issues} for a sprint with capacity {capacity}. "
            f"Consider priority, dependencies, and team skills."
        )
        candidates.append(subset)
    
    # State evaluation: LLM judges each subset
    scored = [(llm.evaluate(c, issues), c) for c in candidates]
    
    # Select best scoring candidate
    return max(scored, key=lambda x: x[0])[1]

Where ToT Does NOT Apply: Stage 1 — Issue Type Classification

  • Volume: 10,000+ issues per dataset
  • At 50× CoT cost, ToT would be prohibitively expensive
  • CoT + Self-Consistency (k=5) is the practical alternative

Summary

PUMA StageMethodJustification
Stage 1: Type classificationCoT few-shotHigh volume; binary decision
Stage 2: Priority estimationCoT + Self-ConsistencyMultiple priority levels; ensemble
Stage 3: Sprint planningToT (BFS b=3)Combinatorial search; low volume
Stage 4: Iterative refinementReflexionSelf-improvement without re-search

Relation to Other Reasoning Methods

MethodReasoning StructureBacktrackingCost
Zero-shotSingle generationNo
CoTLinear chainNo1.5×
Self-Consistencyk parallel chains → voteNo
ReflexionSequential retry with critiqueLimited~3×
ToTTree search with evaluationYes50–200×
RAP (MCTS)Monte Carlo tree searchYes100–500×

MOCs