Tree of Thoughts: A framework for deliberate problem solving (v2 draft)
Reflexion gave agents memory — the ability to learn from failure across attempts. But it could only learn backward: try, fail, reflect, try again. It never asked “what if I explored multiple paths beforecommitting to one?”Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., & Narasimhan, K. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. NeurIPS 2023.
Tree of Thoughts does exactly that. Instead of generating one reasoning chain and hoping it’s right, ToT treats problem-solving as search— generating multiple candidate thoughts at each step, evaluating which ones are worth pursuing, and pruning the rest. The model doesn’t just think. It deliberates.Same research group again — Shunyu Yao and Karthik Narasimhan co-author ToT, ReAct, and Reflexion. The papers form a deliberate progression: CoT (reason) → ReAct (act) → Reflexion (learn from failure) → ToT (search before committing).
What Tree of Thoughts actually does
ToT frames every problem as search over a tree of states, where each state is the input plus all reasoning accumulated so far. At each step, a generator proposes candidate next thoughts, and an evaluator— the model itself — scores how promising each candidate is. Then a search algorithm decides which branches to explore and which to prune.The framing draws on cognitive science — Daniel Kahneman’s “System 1” (fast, associative) vs “System 2” (slow, deliberate) from Thinking, Fast and Slow (2011). Standard LLM decoding is System 1. ToT augments it with System 2 deliberation: generate options, evaluate them, choose strategically.
What makes the framework general is that none of these pieces are fixed. What counts as a “thought,” how the evaluator scores states, and which search algorithm navigates the tree — all of it adapts to the problem. The paper demonstrates this flexibility across three very different tasks: Game of 24 (BFS + value scoring), Mini Crosswords (DFS + backtracking), and Creative Writing (BFS + voting). Each task teaches a different piece of the framework.
Game of 24: BFS and value scoring
The Game of 24 is a mathematical puzzle: use four numbers and basic arithmetic to reach 24. Given the input 4 9 10 13, one solution is (10 − 4) × (13 − 9) = 24. It seems simple, but it’s a useful testbed: there’s a clear success criterion, the search space is large enough to be challenging, and partial progress can be evaluated independently.The paper tests on 100 hard games (indices 901–1000 from a set of 1,362 sorted by human solve time). A solution counts only if it’s a valid equation that exactly equals 24 using all four input numbers exactly once.
ToT solves Game of 24 with BFS. Each thought is one equation step — reducing four numbers to three, then to two, then to one. At each level of the tree, the generator proposes all promising next equations, and the evaluator classifies each resulting state as “sure,” “maybe,” or “impossible.” BFS keeps the five most promising states per level and discards the rest.
The generator uses a proposestrategy: it generates all candidate next steps in a single prompt, rather than sampling independently. This works well when thoughts are constrained and enumerable — the set of valid one-step equations from four numbers is finite and small enough to propose exhaustively.
The propose prompt is one-shot — one example followed by the current input. All candidates generated in one pass to avoid duplicates.
The evaluator uses a valuestrategy: it scores each state independently by asking “can these remaining numbers reach 24?” It samples three times per state and aggregates (sure = 20, likely = 1, impossible = 0.001). A state scored “sure” by all three samples is overwhelmingly preferred over one that gets mixed signals.The scoring weights are hand-tuned. The paper doesn’t explore sensitivity — whether a different weighting (say, sure = 10, likely = 2) would meaningfully change results is left as an implicit assumption.
Two numbers — evaluator finds 4×6=24 immediately. Three sure votes = highest confidence. Branch prioritized.
The results are not marginal. CoT manages 4%. CoT with self-consistency (100 independent samples, majority vote) reaches 9%. ToT with b=5 achieves 74%. The visualization below puts CoT-SC and ToT side by side on the same input — the point isn’t that ToT samples more, it’s that it samples selectively.CoT-SC explores multiple paths but blindly — no evaluation, no pruning, no backtracking. ToT uses fewer completion tokens than CoT-SC (k=100), though it costs more overall because each search step re-sends the full context. Prompt tokens accumulate with every generator and evaluator call.
Mini Crosswords: DFS and backtracking
Mini Crosswords requires a completely different search shape. Each game is a 5×5 grid with ten clues (5 across, 5 down), and each thought is one word to fill in. With 5–10 thought steps per game, the tree is too deep for BFS — keeping five candidates at each of ten levels would require exploring millions of branches. DFS is the right algorithm here: commit to the most promising word, go deeper, and backtrack when the evaluator catches a conflict.
The key insight is why backtracking is necessary at all. Filling in one row constrains every column it crosses. A word that looks plausible in isolation may make a later clue impossible to satisfy. The evaluator catches these conflicts by asking “given the current letter constraints, can this remaining clue still be solved?” — and DFS uses that signal to backtrack rather than continue down a dead end.DFS with backtracking is what makes Crosswords work at all. Without it, word accuracy drops from 60% to 20%. The model can’t see 5 steps ahead, so it inevitably picks words that conflict with later clues. Backtracking lets it recover from those mistakes — the same principle as Reflexion, but within a single episode rather than across trials.
The ablation structure is revealing. ToT with backtracking achieves 60% word accuracy and solves 4 of 20 full games. Remove backtracking and accuracy collapses to 20%. Remove pruning (keep all candidates without eliminating “impossible” states) and it drops to 41.5%. Recovery from mistakes — not merely generating more candidates — is the mechanism doing most of the work.CoT achieves 15.6% word accuracy on this task, and solves none of the 20 full games. The 44-point gap (15.6% to 60%) is driven almost entirely by the ability to backtrack when a word creates unsolvable constraints downstream.
Creative Writing: BFS with voting
Creative Writing is the surprising case in the paper. The input is 4 random sentences; the output must be a coherent 4-paragraph passage ending in each of those sentences. Unlike Game of 24, there’s no equation to check and no constraint to violate. There’s no obvious way to decompose “write a coherent essay” into a multi-step chain.
ToT reframes the problem entirely. A “thought” here isn’t a sentence or a paragraph — it’s an entire writing plan. The tree has depth 2: generate 5 plans, vote on the best one, then generate 5 passages from the winning plan, vote again. The “tree” is more of a tournament bracket than a deep search.This is the only task in the paper where the evaluator uses a voting strategy rather than value scoring. The voting prompt shows all 5 plans together and asks the model to pick the best one. Coherence is easier to judge comparatively (“which plan is better?”) than absolutely (“rate this plan 1–10”). The paper calls this “step-wise self-consistency.”
The results are real but modest. ToT scores 7.56/10 on GPT-4 coherence ratings versus 6.93 for CoT and 6.19 for IO. In blind human comparisons, readers preferred ToT over CoT in 41 of 100 pairs versus 21 that preferred CoT. The gap is smaller here than in the other tasks — and that’s not coincidental. When evaluation is subjective, the evaluator loses leverage. The “sure / maybe / impossible” clarity that makes Game of 24 work dissolves into “somewhat better / about the same.”This pattern — smaller gains when evaluation is more subjective — seeds the open question at the end of the post. The evaluator is the most novel and most fragile part of ToT. Where it can’t score confidently, ToT’s advantage over brute-force sampling shrinks.
The formal progression: IO to ToT
Having seen ToT in three concrete settings, the formal notation becomes a satisfying zoom-out rather than a wall of symbols. The paper formalizes each prompting method as a sampling distribution over the same model — and shows that IO, CoT, and CoT-SC are all special cases of ToT: trees of limited depth and breadth, with no evaluation.
IO prompting— sample one output directly:
Chain-of-thought— sample a reasoning chain, then the answer:
CoT-SC— sample k chains independently, majority vote:
Tree of Thoughts— search over a tree of states with generation + evaluation + pruning. A state is the input plus thoughts so far. At each step:
BFS keeps the best states per level. DFS goes deep, backtracking when the evaluator scores a state below threshold . IO, CoT, and CoT-SC are all special cases of ToT — trees of limited depth and breadth, with no evaluation.
The score aggregation scale — how “sure / likely / impossible” votes get converted into a numerical ranking — is visualized below. The asymmetry is intentional: a single “impossible” rating drags the aggregate score toward zero, making the evaluator conservative by design.
The cost of deliberation
Search is not free. ToT uses roughly 5–100× more tokens than CoT depending on the task. On Game of 24, that’s ~$0.74 per task versus ~$0.003 for a single CoT chain. The paper acknowledges this but frames it carefully: ToT uses comparabletokens to CoT-SC with 100 samples (~$0.47), yet dramatically outperforms it. The compute isn’t wasted — it’s spent more intelligently.The paper reports average token counts per task (Table 1) but not dollar costs. We derived costs using 2023 GPT-4 8K pricing ($0.03/1K prompt tokens, $0.06/1K completion tokens). ToT costs more per prompt token than CoT-SC despite using fewer total completion tokens, because each search step re-sends the full context — prompt tokens accumulate with every generator and evaluator call.
Even with an oracle that picks the single correct chain from 100 CoT samples, you only reach 49% — and at similar token cost to ToT.The oracle result (49%) is an upper bound, not an achievable method. It asks: even if you could magically identify the correct CoT chain among 100 samples after the fact, how often does the right answer appear at all? The answer is 49% — which means brute-force sampling hits a ceiling that guided search does not. ToT b=5 reaches 74% with no oracle needed. The question isn’t whether search costs more. It’s whether the problems you’re solving are worth spending tokens on.
The bigger picture
Looking back from 2026, ToT’s most important contribution isn’t the search algorithm — BFS and DFS are textbook techniques. It’s the demonstration that a single language model can play multiple functional roles simultaneously: generator, evaluator, and final answer producer. Once you externalize those roles into separate agents that communicate, you’re essentially building a multi-agent system. ToT is the single-agent precursor to that decomposition.The connection to inference-time compute is direct. OpenAI’s o1 and o3 models use a similar principle — spending more compute at inference time to improve reasoning — but shift the search from explicit prompt-level orchestration (ToT’s approach) to implicit search trained via reinforcement learning. The insight that “more thinking time = better answers” is the same; the implementation moved inside the model.
ToT showed that a single model can generate, evaluate, and search over its own reasoning. That pattern — one system playing multiple deliberative roles — became a key precursor to the multi-agent architectures that followed.
The arc from CoT to ReAct to Reflexion to ToT traces a progression of capabilities: scratch space → action → memory → search. Each paper adds one dimension of problem-solving. And each one surfaces the same underlying question: how much can you trust the model to evaluate its own reasoning?
Open question
The results are impressive — but they share a pattern. Every task has a clear success criterion: does the equation equal 24? Does the word fit the crossword constraints? Even Creative Writing, the most subjective task, gets evaluated through side-by-side comparison where “better” is relatively judgeable. The paper never ventures into territory where the evaluator has no ground to stand on — tasks where “good” is genuinely ambiguous, where thoughts resist clean decomposition, or where the search space is too open-ended for scoring to mean anything. Those are exactly the conditions that broke Reflexion on WebShop.The evaluator is the most novel and most fragile part of ToT. On Creative Writing — the closest thing to an open-ended task in the paper — the gap between ToT and baselines is smallest. That’s not a coincidence: when “good” becomes subjective, the evaluator loses its leverage.
That blind spot matters more in hindsight. The GPT-3.5 results buried in the appendix show that ToT’s evaluator needs frontier-model capability to work at all — ToT with GPT-3.5 on Game of 24 scores 4%, no better than single-chain CoT with GPT-4. If the evaluator requires that level of capability, the question becomes whether the model can simply internalize the search process itself. OpenAI’s o1 and o3 suggest it can — spending more inference-time compute on internal deliberation rather than external orchestration. If that’s the direction, ToT becomes a historical proof of concept: it showed search works, and then the search moved inside the model. If explicit, auditable search retains value — for interpretability, for debugging, for domains where you need to see the reasoning — it becomes the foundation.
With ToT, the foundational arc is complete: scratch space (CoT), action (ReAct), memory (Reflexion), search (ToT). What comes next is decomposition — splitting these roles across multiple agents that coordinate, debate, and build on each other’s reasoning.