Tree-of-Thought: Deliberate Problem Solving

Advanced 🕐 22 min Lesson 4 of 16

What you'll learn

Explain the structural difference between chain-of-thought and tree-of-thought and why it matters for backtracking problems
Design thought decomposition, generation, and evaluation components for a given problem type
Choose between BFS, DFS, and beam search based on the problem characteristics
Identify when ToT is the appropriate tool vs. when simpler CoT is sufficient

The Limitation Tree-of-Thought Solves

Both standard chain-of-thought and self-consistency are linear in one important sense: once a reasoning step is written, the model cannot go back and revise it. Self-consistency helps by running multiple complete paths and voting on the result, but no individual run can backtrack mid-reasoning to try a different approach. When a problem requires exploring multiple branches — trying one approach, recognising it is a dead end, backtracking, and trying another — neither technique is sufficient.

Tree-of-Thought (Yao et al., 2023) addresses this by making reasoning tree-structured rather than linear. Instead of one chain of thoughts, the model maintains a tree where each node is a "thought" — a coherent reasoning step — and branches represent different possible continuations. Classical search algorithms (BFS, DFS, beam search) navigate the tree, and an evaluator function scores each node's promise before deciding where to explore next.

The Performance Gap

The original Tree-of-Thought paper used the "Game of 24" as a benchmark: given four numbers, use basic arithmetic operations to reach 24. This is a search problem — there are many combinations to try, and most are dead ends. With standard GPT-4 and chain-of-thought, accuracy on this task was 4%. With Tree-of-Thought, it rose to 74%. The same model, the same knowledge, a different problem-solving structure.

This result illustrates when ToT is the right tool: problems with a combinatorial search space, where many paths exist and most are dead ends, and where a correct solution requires finding one of the few valid paths through the space.

The Three Components

A ToT implementation requires three design decisions:

1. Thought decomposition: Break the problem into steps where each step produces a "thought" — a partial solution or reasoning fragment. For the Game of 24, a thought might be "4 + 6 = 10, leaving [10, 6, 8]." For a writing task, a thought might be a paragraph outline or a key argument. The granularity of each thought matters: too fine and the tree becomes unwieldy; too coarse and there is not enough branching to help.

2. Thought generation: At each node, generate multiple candidate next thoughts (branches). You can do this with a single prompt asking for N options, or with N separate calls. A typical branching factor is 3–5 candidates per node.

3. State evaluation: This is the most critical and least automatic component. You need a function that scores each thought's promise before committing to exploring it further. The evaluation can be:

Model-generated: Ask the model to rate each thought: "On a scale of 1–10, how promising is this partial solution? Explain your reasoning."
Rule-based: For constraint-satisfaction problems, check how many constraints each partial solution satisfies.
Heuristic: For game-like problems, use a domain heuristic (e.g., closeness to target value in numerical puzzles).

Search Strategies

Once you have thoughts and evaluations, you need a search strategy:

Breadth-first search (BFS): Explore all thoughts at depth N before going to depth N+1. Good when the solution is likely at a shallow depth, or when you want to find the best solution at a given level before going deeper. Computationally expensive for deep trees.
Depth-first search (DFS): Follow the most promising path to completion before backtracking. More token-efficient for deep trees. Risk of getting stuck in a locally promising but globally suboptimal branch.
Beam search: Keep only the top-K most promising thoughts at each level, discarding the rest. A good middle ground — beam width of 3–5 usually captures the best paths without the full cost of BFS.

A Practical ToT Prompt Template

For problems where you want the model to self-evaluate (without external tooling), this template implements a simplified ToT in a single complex prompt:

You are solving [problem] using a tree of thoughts. At each step, generate 3 possible next moves, evaluate each one on a scale of 1–5 for how likely it leads to a correct solution, and continue with the highest-rated option. If a path reaches a dead end, backtrack to the last branch point and try the second-best option.

Problem: [problem statement]

Step 1 — Generate 3 candidate approaches:
Approach A: ...
Approach B: ...
Approach C: ...

Evaluate each (1–5): A=[score], B=[score], C=[score]

Continuing with Approach [best]: ...

When to Use ToT vs. CoT

ToT is overkill for most prompting tasks. The overhead — multiple generations, evaluation steps, search management — is significant. Reserve it for problems that genuinely have branching solution spaces:

Use ToT: Planning tasks with multiple valid sequences, constraint-satisfaction problems (scheduling, configuration), creative brainstorming where you want to explore divergent paths before converging, multi-step puzzles
Use CoT instead: Sequential reasoning where each step follows directly from the last, mathematical calculations, explanation generation, tasks where backtracking is rarely needed

One practical note: GPT-4 and Claude Opus show the strongest performance on ToT tasks due to the quality of their thought generation and self-evaluation. Smaller models often produce poorly calibrated evaluations — they rate their own dead-end thoughts highly — which defeats the purpose of the search. If you are implementing ToT, use the strongest available model for evaluation even if you use a smaller model for generation.

Key takeaways

Tree-of-Thought makes reasoning tree-structured with branching and backtracking — standard CoT on Game of 24 achieved 4%; ToT achieved 74%
Three required components: thought decomposition (granularity), thought generation (branching factor 3–5), and state evaluation (the hardest part)
Beam search with width 3–5 is the pragmatic default — captures most gains without full BFS cost
Reserve ToT for problems with combinatorial search spaces where many paths exist and most are dead ends
Evaluator quality is the bottleneck — use the strongest available model for evaluation, even if generation uses a smaller model

← Previous

Self-Consistency and Sampling Strategy

Extended Thinking and Reasoning Models