PN: Tree of Thoughts — Deliberate Multi-Path Reasoning for LLMs

Core Idea

Tree of Thoughts (ToT) (Yao et al., NeurIPS 2023) frames LLM problem-solving as a search over a tree of intermediate reasoning steps (“thoughts”). Unlike Chain-of-Thought which generates one linear path, ToT explores multiple candidate continuations at each step, evaluates them with an LLM judge, and applies classical search algorithms (BFS, DFS, beam search) to find the best path. This enables backtracking and deliberate planning — capabilities absent in autoregressive text generation.

Motivation: The Limitations of Chain-of-Thought

Standard CoT generates a single left-to-right reasoning trace. This fails when:

Problems require exploration: multiple plausible solution paths exist and the optimal one is non-obvious
Local decisions affect global outcomes: an early error propagates irreversibly
Backtracking is needed: the current path is provably dead-end but CoT cannot reverse

ToT addresses all three by treating reasoning as search, not generation.

The Tree Structure

                    [Problem]
                   /    |    \
           [Thought A] [Thought B] [Thought C]
           /     \        |         /    \
         [A1]   [A2]   [B1]     [C1]  [C2]
          ✓      ✗      ✓        ✗     ✓
                        |              |
                       [B1a]         [C2a]  ← best path

Node: a partial solution state (intermediate thought)
Edge: one reasoning step (LLM generation from parent → child)
Evaluation: LLM assigns each node a score (sure / likely / impossible)
Search: BFS, DFS, or beam search traverses the tree to find the optimal leaf

Four Components

1. Thought Decomposition

Decide what constitutes one “thought” — the granularity of search steps:

Word-level: each thought = one word (useful for constrained generation)
Sentence-level: each thought = one reasoning sentence
Paragraph-level: each thought = a complete sub-problem solution

Finer granularity → more exploration, more tokens. PUMA uses paragraph-level for backlog analysis.

2. Thought Generation

Two strategies:

Strategy	Method	When to Use
Sample	Generate k thoughts independently with temperature	Creative tasks; diverse exploration
Propose	Generate k thoughts in one call with “Give k different approaches” prompt	Analytical tasks; efficiency

3. State Evaluation

LLM evaluates each thought node:

Prompt: "Given problem [P] and partial solution [S], rate the likelihood 
         this path leads to a correct solution: sure / likely / impossible"

Alternatively: numerical score (0–10) for soft search priorities.

4. Search Algorithm

Algorithm	Description	Best For
BFS	Expand all nodes at depth d before depth d+1	Short horizons; guaranteed optimum
DFS	Explore one path fully before backtracking	Long horizons; memory-constrained
Beam search	Keep top-b nodes at each depth	Balance quality vs. compute
MCTS	Monte Carlo rollouts for expected value	Stochastic environments

Empirical Results (Yao et al., 2023)

Benchmark	CoT (GPT-4)	CoT-SC (k=100)	ToT (BFS b=5)
Game of 24	4%	9%	74%
Creative Writing	—	—	+41% win rate vs. CoT
Mini Crosswords	16%	33%	60%

Key insight: for the Game of 24 (mathematical puzzle requiring multi-step backtracking), ToT improves over CoT by 18.5×. This highlights ToT’s advantage precisely where sequential reasoning breaks down.

Cost Analysis

ToT is significantly more expensive than CoT:

Method	API calls per task	Tokens	Relative cost
CoT	1	~500	1×
Self-Consistency (k=40)	40	~20k	40×
ToT (BFS b=5, d=3)	~50–200	~25–100k	50–200×

Compute Cost

ToT is 50–200× more expensive than standard CoT. It is impractical for high-volume inference (e.g., PUMA Stage 1 bulk triage of 10k+ issues). Reserve for low-volume, high-stakes decisions.

PUMA Application

Where ToT Applies: Stage 3 — Backlog Prioritisation

Scenario: Given a sprint backlog of 15 issues, determine the optimal 5-issue sprint plan considering dependencies, priority, and team capacity.

This is a search problem: multiple orderings exist, and the optimal one requires evaluating combinations — exactly what ToT is designed for.

def tot_sprint_plan(issues: list[dict], capacity: int, k: int = 3) -> list[dict]:
    """ToT-based sprint planning: BFS over candidate sprint subsets."""
    candidates = []
    
    # Thought generation: propose k alternative 5-issue subsets
    for _ in range(k):
        subset = llm.propose(
            f"Select 5 issues from {issues} for a sprint with capacity {capacity}. "
            f"Consider priority, dependencies, and team skills."
       )
        candidates.append(subset)
    
    # State evaluation: LLM judges each subset
    scored = [(llm.evaluate(c, issues), c) for c in candidates]
    
    # Select best scoring candidate
    return max(scored, key=lambda x: x[0])[1]

Where ToT Does NOT Apply: Stage 1 — Issue Type Classification

Volume: 10,000+ issues per dataset
At 50× CoT cost, ToT would be prohibitively expensive
CoT + Self-Consistency (k=5) is the practical alternative

Summary

PUMA Stage	Method	Justification
Stage 1: Type classification	CoT few-shot	High volume; binary decision
Stage 2: Priority estimation	CoT + Self-Consistency	Multiple priority levels; ensemble
Stage 3: Sprint planning	ToT (BFS b=3)	Combinatorial search; low volume
Stage 4: Iterative refinement	Reflexion	Self-improvement without re-search

Relation to Other Reasoning Methods

Method	Reasoning Structure	Backtracking	Cost
Zero-shot	Single generation	No	1×
CoT	Linear chain	No	1.5×
Self-Consistency	k parallel chains → vote	No	k×
Reflexion	Sequential retry with critique	Limited	~3×
ToT	Tree search with evaluation	Yes	50–200×
RAP (MCTS)	Monte Carlo tree search	Yes	100–500×

LN-Yao-2023-TreeOfThoughts — source literature note
PN-CoT-FewShot-Prompting — CoT as baseline; ToT extends it
PN-COSTAR-SelfConsistency — Self-Consistency as cheaper alternative
PN-Reflexion-SelfCritique — sequential improvement without tree search
PN-ReAct-AgentPattern — action-observation loop (ToT can drive ReAct actions)

PUMA Vault

Explorador

Tree of Thoughts — Deliberate Multi-Path Reasoning for LLMs

PN: Tree of Thoughts — Deliberate Multi-Path Reasoning for LLMs

Motivation: The Limitations of Chain-of-Thought

The Tree Structure

Four Components

1. Thought Decomposition

2. Thought Generation

3. State Evaluation

4. Search Algorithm

Empirical Results (Yao et al., 2023)

Cost Analysis

PUMA Application

Where ToT Applies: Stage 3 — Backlog Prioritisation

Where ToT Does NOT Apply: Stage 1 — Issue Type Classification

Summary

Relation to Other Reasoning Methods

MOCs

Vista Gráfica

Tabla de Contenidos

Retroenlaces

PUMA Vault

Explorador

Tree of Thoughts — Deliberate Multi-Path Reasoning for LLMs

PN: Tree of Thoughts — Deliberate Multi-Path Reasoning for LLMs

Motivation: The Limitations of Chain-of-Thought

The Tree Structure

Four Components

1. Thought Decomposition

2. Thought Generation

3. State Evaluation

4. Search Algorithm

Empirical Results (Yao et al., 2023)

Cost Analysis

PUMA Application

Where ToT Applies: Stage 3 — Backlog Prioritisation

Where ToT Does NOT Apply: Stage 1 — Issue Type Classification

Summary

Relation to Other Reasoning Methods

Related Notes

MOCs

Vista Gráfica

Tabla de Contenidos

Retroenlaces