ReAct: How giving LLMs the ability to think and act changed everything
Every agent system you interact with today runs some version of the same loop: think, act, observe, think again. ReAct is the paper that established why that loop works, and what breaks when you remove either half of it.Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023.
The paper’s own framing is precise: reasoning and acting in language models had “primarily been studied as separate topics.” ReAct’s contribution was testing what happens when you combine them — not just in one domain, but across fundamentally different task types — and analyzing the failure modes of each approach carefully enough to explain why the combination wins.
The two approaches it combined
Chain-of-thought prompting taught models to reason step by step before answering. It works — but it reasons in a closed loop. No new information enters. The model uses what’s already in its weights. When it doesn’t know something it doesn’t stop — it generates whatever comes next. The result is confident reasoning from potentially wrong premises.Wei, J. et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Published the same year as ReAct. The two papers are in direct conversation.
Action-only approaches gave models tools to interact with the world — search engines, APIs, browsers. The information problem is addressed: the model can look things up. But without reasoning to guide those actions the model can’t synthesize what it finds, can’t track what it’s already tried, can’t maintain a plan across many steps.Nakano, R. et al. (2021). WebGPT: Browser-Assisted Question-Answering with Human Feedback. Fine-tuned GPT-3 to use a text-based browser — search, click, scroll — but without explicit reasoning traces between actions.
ReAct — the name is the idea: Reasoning + Acting, interleaved.
The ReAct loop
ReAct augments the model’s action space with language — a “thought” step that doesn’t affect the environment but updates the model’s context before the next action. In CoT, all reasoning happens up front from memory. In ReAct, each thought is informed by what the agent just observed through its actions.
At each timestep , an agent takes action conditioned on its context — the full history of observations and actions so far:
ReAct augments the action space from to , where is the space of natural language. A thought differs from a regular action in one key way: it produces no observation from the environment. Instead it only updates the context:
This means thoughts are free — they cost nothing externally — but they change what the model conditions on for every subsequent action. CoT has no . Act has no . ReAct is .
The model generates a thought, takes an action, receives an observation, generates another thought informed by that observation. Repeat.
What this changes is what the model knows at each step. CoT’s context is static — training knowledge only. ReAct’s context grows with every observation. Each verified fact from the environment replaces a potential assumption from memory.This is the key insight: it’s not just that ReAct canlook things up. It’s that every observation displaces an assumption, making subsequent reasoning more reliable.
Step through the comparison below — this is a real HotpotQA example from the paper, the same knowledge benchmark that appears in the results. Watch the knowledge panel on each side.
CoT assumes the Apple Remote was designed for Apple TV — plausible, wrong — and answers in one step. ReAct’s first search returns the actual answer: it was designed for Front Row, a discontinued media center program. That observation changes every subsequent step. The model isn’t reasoning from memory anymore. It’s reasoning from what it found.This example is drawn directly from the paper’s HotpotQA evaluation. The trajectories are real model outputs, not hand-crafted demonstrations.
What they tested it on
The paper tests ReAct on two task types that make structurally different demands on an agent.
The first is a navigation problem— multi-hop question answering (HotpotQA) where the answer requires chaining information you can’t know in advance. Each search opens new branches. The challenge is knowing which thread to follow. You can’t plan the full path before you start.
The second is a state problem— long-horizon decision making in a simulated household environment (ALFWorld) where every action changes the world. The challenge isn’t finding information. It’s not losing track of what you’ve done and what’s left across 20, 30, 50 steps.
The widget below uses actual trajectories from the paper’s appendix for the ALFWorld state-tracking task. Both agents have the same goal: find a knife, clean it, place it on the countertop.
Act finds the knife, then tries to clean it without going to the sinkbasin first. Nothing happens. It has no record of why it failed or what comes next. It loops — back to the countertops, attempting to pick up knives that are no longer there.
ReAct’s thought 4 says verbatim: “Now I take a knife (1). Next, I need to go to sinkbasin (1) and clean it.” That sentence is in the context window. The model goes to the sinkbasin, cleans the knife, finishes in 7 actions.The loop isn’t a quirk. It’s what happens when a model has to reconstruct its current state by reading back through a long sequence of raw observations — and can’t.
Results
Decision making
The trajectory above is one example. The full picture is more striking. Across 134 unseen ALFWorld games, the chart below shows overall success rates and per-task-type breakdowns.
On WebShop — a simulated online store with 1.18 million products — ReAct hits 40% success versus 28.7% for the previous best approach (imitation learning + reinforcement learning).
The scale of the gap is worth sitting with. The previous best approaches required thousands of expert demonstrations and dedicated training pipelines. ReAct matches them with a couple of handwritten examples in a prompt.Shridhar, M. et al. (2021). ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. ICLR 2021. BUTLER is the imitation learning agent introduced in this paper alongside the ALFWorld benchmark. It learns by watching thousands of expert demonstrations, then tries to replicate the behavior — the standard approach before LLM-based agents.
Knowledge tasks
On knowledge tasks the picture is more nuanced. ReAct does not cleanly beat CoT on its own — it wins on FEVER (60.9 vs 56.3) and loses on HotpotQA (27.4 vs 29.4). Both get most answers right (CoT 86%, ReAct 94%), but when they fail, they fail in opposite ways.
56% of CoT’s failures on HotpotQA are hallucinations — the model asserted facts it didn’t have. ReAct’s hallucination rate in failure cases? Zero. Every factual claim came from an observation it actually received. But ReAct’s constrained thought structure — having to fit reasoning into interleaved steps — produces more reasoning errors than CoT’s unconstrained chains. They fail in opposite directions.This is the core tradeoff: CoT has better reasoning flexibility but hallucinates. ReAct is grounded but constrained. Combining them recovers both strengths.
Which is why combining them recovers both. When ReAct fails to return an answer within its step budget, fall back to CoT-SC. When CoT-SC can’t reach a confident majority answer, fall back to ReAct. The combination reaches 35.1 on HotpotQA versus 29.4 for CoT alone.Wang, X. et al. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. ICLR 2023. CoT-SC samples multiple reasoning chains and takes the majority answer. More robust than a single chain, but still can’t verify facts against the world.
The bigger picture
Looking back from 2026, the lasting contribution isn’t any single benchmark number. It’s the pattern: give a language model the ability to think between actions, and it solves problems that neither reasoning nor acting can handle alone. Every agent framework built since — AutoGPT, LangChain agents, Claude’s tool use — runs some version of this loop.
ReAct didn’t just combine two techniques. It established the interface between language models and the world.
What it didn’t solve is what happens when the loop goes wrong.
Open question
ReAct works well in bounded environments — a Wikipedia API with three actions, a simulated store with a defined product space. The reasoning loop holds up. But scale the action space and a new problem emerges: when the agent goes wrong, there’s no recovery mechanism within the run. It keeps going, accumulating errors, with no way to step back and say “that approach isn’t working, let me try something different.” The loop is powerful but it’s memoryless across attempts.
Reflexion (Shinn et al., 2023)is the direct response to that. What if the agent could reflect on what went wrong in natural language, store that reflection, and use it to do better on the next attempt — without any retraining?