Chain-of-Thought: The prompting trick that unlocked reasoning in language models
Before models could act in the world, they had to learn to reason about it. This paper introduced chain-of-thought prompting — a technique that made that possible not by changing the model, but by changing what you put in the prompt. It’s a surprisingly simple idea, and it became the foundation for nearly everything that followed.Wei, J., Wang, X., Schuurmans, D., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022.
The problem standard prompting couldn’t solve
In 2022, scaling language models had hit a ceiling on reasoning tasks. More parameters helped with language. They didn’t help with arithmetic, commonsense reasoning, or symbolic manipulation. The scaling curves were flat.Before CoT, models were failing at questions a ten-year-old could answer: “Mike plays ping pong for 40 minutes. In the first 20 minutes he scores 4 points. In the second 20 minutes he scores 25% more. How many total?” Or: “A coin is heads up. Maybelle flips it. Shalonda doesn’t. Is it still heads up?” Not because they lacked knowledge — but because answering requires holding a chain of reasoning across multiple steps.
The standard approach — give the model a few input-output examples and ask it to continue the pattern — works well when the answer can be retrieved or pattern-matched. It fails when reaching the answer requires multiple steps of reasoning. The model sees the question and jumps directly to an answer. When the question is hard enough, that jump produces confident nonsense.
The cafeteria question below has a correct answer of 9. Standard prompting returns 27 — a plausible-looking number that you’d get by adding everything together without tracking what happened to it.
The model isn’t wrong because it lacks knowledge. It’s wrong because it has no scratch space — no way to hold intermediate results while working toward a final answer. Every token it generates is a commitment. Without room to reason, it compresses multi-step problems into a single intuitive leap. Sometimes that works. For anything requiring bookkeeping, it doesn’t.
Chain of thought: show the reasoning, not just the answer
Instead of showing the model question-answer pairs, you show it question-reasoning-answer triples. The model sees how to work through a problem step by step, and applies the same pattern to new questions.
Standard prompting shows the model (xi, yi) pairs as exemplars. Given a new x, it samples y ~ P(y | x) — jumping directly to an answer. CoT adds reasoning traces to the exemplars: (xi, ri, yi). Now given x, the model samples r, y ~ P(r, y | x) — reasoning before answering. Ordering is load-bearing: placing r after y collapses performance to baseline.
The key shift is what gets generated. Standard prompting produces a single token sequence that jumps to an answer. CoT prompting produces a longer sequence where intermediate conclusions become context for subsequent steps. The model isn’t smarter — it’s using its own output as working memory.
The examples were manually written by the authors and reused across all benchmarks within a task category. Robustness experiments confirmed that different annotators, different writing styles, and different exemplar sets all produce similar gains. What matters is that the reasoning is shown — not how it is written.This is a key finding: CoT is robust to prompt variation. The mechanism doesn’t depend on specific wording or formatting — it depends on the presence of sequential reasoning steps.
Three hypotheses, three eliminations
The performance gain from CoT prompting requires an explanation. Three alternatives are tested and eliminated.An ablation study removes or modifies one component of a system at a time to test whether it’s responsible for an observed effect. If removing it collapses performance, it was load-bearing. If nothing changes, it wasn’t.
Every ablation collapses back to roughly the standard baseline (~6%). Only the full chain of thought breaks out — the only condition that gives the model actual scratch space. Why each alternative fails:
Extra compute tokens are not the mechanism. Replacing reasoning steps with dots matched to the same token length produces no gain over baseline. The scratch space is there, but it’s blank — there’s nothing to reason with.
Externalizing the equation is not sufficient. Showing the mathematical equation without natural language reasoning helps on simple datasets where problems translate directly to equations. On GSM8K, it fails — the problems are semantically complex enough that identifying which equation to write requires the reasoning steps themselves.GSM8K (Cobbe et al., 2021) is a dataset of 1,319 grade school math word problems requiring 2–8 steps of arithmetic. Example: “James decides to run 3 sprints 3 times a week. He runs 60 meters each sprint. How many total meters does he run a week?” The paper’s headline benchmark — hardest of the five, and the one with the flattest standard-prompting scaling curve.
Knowledge activation is not sufficient. Placing the reasoning chain afterthe answer — so the model has already committed — collapses performance to baseline. If CoT helped merely by priming relevant knowledge, the order would not matter. It does.This is the most important ablation. It proves that the model needs to reason beforecommitting to an answer — not just have reasoning-shaped text nearby.
What survives: sequential natural language reasoning must precede the answer. The scratch space has to contain real work, and the model has to use it before committing. That is the mechanism.
Below 100B parameters, it backfired
There’s a catch. CoT only works if the model is large enough to use it. Below ~10B parameters, models generated fluent but incoherent reasoning chains and committed to wrong answers more confidently than if they had answered directly.This was counterintuitive: showing a small model how to reason made itworse. The model generated text that looked like reasoning but wasn’t — and then trusted it.
The gains only appear reliably above ~100B parameters. On GSM8K, PaLM 540B jumps from 17.9% to 56.9% — surpassing a model explicitly fine-tuned for this benchmark, using 8 prompt examples and no weight updates.Fine-tuned models are trained on thousands of labeled examples per task — their weights change to specialize. CoT prompting changes nothing in the model. The same weights, prompted differently, beat the specialist.
The paper interpreted this as an emergent ability — a capability that doesn’t improve gradually with scale but appears suddenly once a threshold is crossed.The paper uses “reasoning” throughout, but the authors are explicit: showing that a model produces better answers via intermediate steps doesn’t answer whether it’s actually reasoning in any meaningful sense. The chain of thought is an output — not a window into what the model is computing internally. Think less like a volume dial being turned up and more like water being heated: nothing dramatic happens between 0°C and 99°C, then at 100°C it changes state entirely.
What we’ve learned since
The original CoT paper left a sharp question open: are the weights of a 10B model fundamentally incapable of reasoning, or just incapable of being prompted into it?
Subsequent work answered this decisively: it’s the latter. Small models can reason — you just can’t get there through prompting alone. Fine-tuning on reasoning traces, distillation from larger models, and curated training data all unlock multi-step reasoning well below the ~100B threshold.The evidence accumulated quickly. Zelikman et al. (2022) showed models can bootstrap reasoning via iterative self-training (STaR). Mukherjee et al. (2023) fine-tuned a 13B model on GPT-4 explanation traces and matched much larger prompted models. DeepSeek-AI (2025) distilled reasoning into 1.5B–7B parameter models that perform multi-step reasoning the original paper said required 100B+.
This reframes the scale threshold. The capacity for reasoning doesn’t emerge at 100B parameters — it’s latent well below that. What scales is the model’s ability to activate that capacity from a few prompt examples alone, without any weight updates. CoT discovered something about the limits of prompting, not about reasoning itself.
The phase transition framing has also been challenged directly.Schaeffer et al. (2023) argued that many apparent emergent abilities are artifacts of discontinuous evaluation metrics like exact-match accuracy. Measured with continuous metrics, the scaling curves are smooth — no sudden jump, just gradual improvement that crosses a visibility threshold.If the “phase transition” is partly a measurement artifact, the dramatic story — reasoning appears suddenly at scale — needs qualifying. The capability may build gradually, invisible to coarse metrics until it crosses a threshold of usefulness.
What remains genuinely open is the mechanistic question. The ablations ruled out extra compute, but later work found that even logically invalid chains help, as long as they contain the right bridge entities connecting question to answer.Wang et al. (2023) found that what matters in a chain is the presence of relevant intermediate entities — not logical validity. Chains with wrong reasoning but correct bridge objects still improve performance, suggesting CoT works partly by keeping relevant information in context, not only by performing step-by-step deduction.So the deeper question is no longer “can small models reason?” — we know they can. It’s: what exactly changes during training that makes reasoning activatable through prompting? And why does that activation mechanism require scale when the underlying capacity does not? Those questions remain unanswered.
The bigger picture
Looking back from 2026, CoT’s contribution is easy to understate because the idea is so simple. But it established something fundamental: language isn’t just the medium for answers — it’s a substrate the model can reason through. Every technique that followed builds on that insight.
At sufficient scale, you don’t need to retrain a model to unlock reasoning. You just need to show it how to think. But CoT couldn’t solve the closed-world problem. The reasoning is powerful but sealed — no new information enters. When the model doesn’t know a fact, it invents one.
ReAct (Yao et al., 2022)is the direct response. It keeps CoT’s reasoning loop but opens it to the outside world: the model can pause mid-chain to search, look something up, or verify a fact, then fold what it finds back into its reasoning. Where CoT gave models scratch space, ReAct gave them hands.