Reflexion: What happens when an agent can learn from its own mistakes

@kraxkrokat · March 2026

ReAct gave agents a loop: think, act, observe, repeat. It works well. But it has a hard limit — when the agent goes wrong, there’s no recovery mechanism. The loop just continues, accumulating errors, with no way to say “that approach isn’t working, let me try something different.” Each run starts fresh.

Reflexion asks what happens if you add one thing to that picture: after a failed attempt, let the agent write down what went wrong in natural language, store it, and read it back before the next try.Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023.

No weight updates. No fine-tuning. No gradient descent. Just a paragraph of text prepended to the next prompt.

What Reflexion actually does

Reflexion is essentially a wrapper that sits on top of an existing Actor — ReAct, CoT, or anything else. It adds two components: an Evaluator that scores how well the actor performed, and a Self-Reflection model that, on failure, writes a natural language summary of what went wrong. That summary gets stored in memory and passed back to the actor as additional context in the next trial.Both CoT and ReAct appear as base actors in the Reflexion experiments. Karthik Narasimhan co-authors both ReAct and Reflexion — and Shunyu Yao, ReAct’s first author, is on the Reflexion paper too. The two papers are in direct conversation.

ALFWorld Task

Examine the mug with the desklamp.

Thought

I need to examine the mug with the desklamp. The desklamp is probably on the desk — let me use it.

step 1 of 16

Step through the widget above. The key moment is the middle tab — watch what the evaluator produces, how the self-reflection model turns that into language, and then how that language shows up in trial 2’s prompt as a memory block. The mechanism is almost embarrassingly simple. The actor reads its own post-mortem before trying again.

The formal intuition

Reflexion parameterizes the agent’s policy as where — the actor model plus its memory. Unlike standard RL, the model weights never change. Learning happens entirely through .

After each trial , the evaluator scores the trajectory producing reward . On failure, the self-reflection model generates a verbal summary:

This gets appended to memory, subject to a maximum capacity (typically 1–3 experiences):

How memory accumulates across trials

Each failed trial adds one reflection to memory. Each successful trial stops the loop. The agent walks into every new attempt carrying its own history of mistakes.

Trial 1 — Ω = 3

Trial 1 starts with no memory. The context is just system instructions and the task. The agent attempts the task and fails. At the end of the trial, the evaluator writes a self-reflection — sr₁ — which is appended to the memory buffer.

Trial 1

systemfew-shot

taskattempt 1

FAIL

↓

→ writing sr₁ to memory buffer

trial 1 of 5

The Ω bound matters. It’s not just an engineering detail — it’s the reason Reflexion can’t keep accumulating learning indefinitely. At some point old reflections get dropped to make room for new ones. The agent has a sliding window of experience, not an unlimited archive.The sliding window is typically Ω = 1–3. ALFWorld and HotpotQA use 3 experiences; programming uses just 1. This is driven by context window limits, not by design choice — the authors note that more reflections would be better if the model could handle them. They suggest vector databases as a future alternative.

Three benchmarks, three kinds of feedback

The paper tests Reflexion on three structurally different problems. The first two — ALFWorld and HotpotQA — are the same benchmarks from ReAct, which lets you see directly how much reflection adds on top of the base loop. The third — programming — is new, and it’s the most interesting one because it gives you something the other two don’t: grounded feedback.ALFWorld (Shridhar et al., 2021) is a text-based household task simulator — 134 environments across 6 task types like finding hidden objects, moving objects between rooms, and manipulating objects with other objects. The agent interacts through natural language commands.

On decision making, the improvement is striking. ReAct alone starts at ~62% on ALFWorld and plateaus around 77% by trial 6 — with a hallucination rate of 22% that shows no signs of recovery. ReAct + Reflexion starts at the same point but keeps climbing, completing 130 out of 134 tasks (~97%) by trial 12.The ALFWorld and HotpotQA experiments use GPT-3 as the base model. Programming uses GPT-4. This matters for interpreting the numbers — the 91% HumanEval result reflects a stronger base model, not just a better reflection mechanism.

On knowledge tasks, both ReAct-only and CoT-only sit flat at roughly 33–40% across all HotpotQA trials with no meaningful improvement. Neither can recover from its own failures probabilistically. ReAct + Reflexion reaches ~55–60% by trial 5 — a 20-point gain from the same starting position.HotpotQA (Yang et al., 2018) is a multi-hop Wikipedia question-answering dataset with 113k questions. It requires chaining facts across multiple documents. The evaluator uses exact-match grading — the agent’s answer string must match the ground truth exactly.

Programming is where the numbers are most dramatic. Reflexion achieves 91% pass@1 on HumanEval versus GPT-4’s 80%.HumanEval (Chen et al., 2021) is a benchmark of 164 hand-written Python programming problems with unit tests. Pass@1 measures whether the model’s first attempt passes all tests — no retries, no sampling.

HumanEval (pass@1)

GPT-4 baseline

67.0%

GPT-4

80.1%

Reflexion + GPT-4

91.0%

MBPP

GPT-4

77.5%

Reflexion + GPT-4

77.1%

No improvement on MBPP — a 16.3% false positive rate in self-generated tests made the reflection signal unreliable.

The evaluator in programming is qualitatively different from the others. Instead of an external binary signal, the agent generates its own unit tests using chain-of-thought prompting, filters them for syntactic validity, then executes them. It reflects on the specific test failures — compiler errors, assertion mismatches, stack traces. That’s a level of diagnostic precision that ALFWorld’s binary “done or not done” signal can’t match.Self-generated tests have a catch. On MBPP — another programming benchmark with 374 problems — Reflexion actually underperforms GPT-4 (77.1% vs 80.1%). The culprit: a 16.3% false positive rate in the self-generated test suite, versus just 1.4% on HumanEval. When the agent’s own tests are unreliable, the reflection signal is too.

That groundedness is load-bearing. The self-reflection model can only write useful guidance if the failure signal gives it something concrete to work with. Which brings us to where Reflexion breaks.

Where it works and where it doesn’t

The results above share something: all three tasks have precise, diagnosable failure signals. ALFWorld tells you exactly when you’re stuck. HotpotQA grades by exact match. Programming gives you compiler errors and unit test output. The reflection mechanism is general — the same loop runs in all three cases. What isn’t general is the evaluation step, and that distinction matters more than it might seem.

Phase 1 / 4

The failure signal

ALFWorldReflexion works

Agent repeated 'go to countertop 1' three times. Heuristic triggered: stuck.

ALFWorld detects loops and surface-level failure. A clear, unambiguous signal: the agent is stuck on a specific, nameable action.

WebShopReflexion fails

Agent purchased wrong product. Score: 0.2/1.0

WebShop gives a score between 0 and 1. But a score of 0.2 doesn't say what went wrong — wrong product? Wrong size? Wrong brand? The signal is noisy.

step 1 of 4

The widget above makes it concrete. ALFWorld produces a reflection like: “I tried to clean the knife without first going to the sinkbasin. In the next trial I will go to sinkbasin 1 before cleaning.” That’s a specific instruction the agent can follow literally — because the failure was specific enough to diagnose.

WebShop produces: “I should search more carefully and check all attributes before buying.” That’s a platitude. The agent already knew that. There’s nothing actionable in it because the failure — buying the wrong product out of 1.18 million — doesn’t point to a specific correction. The authors ran four trials, saw no improvement, and stopped.WebShop (Yao et al., 2022) is an online shopping environment with 1.18 million real products. The agent must find and buy a product matching a natural language description. The same Shunyu Yao who created ReAct co-authored WebShop — and the same Karthik Narasimhan advises both.

The paper’s own diagnosis is precise: Reflexion “struggles to overcome local minima choices that require extremely creative behavior to escape.” In ALFWorld, the permissible actions are visible in the observations — the search space is constrained and navigable through systematic error correction. WebShop requires generating novel search queries to find exactly the right product, and when a search fails, the agent can’t generate meaningfully different strategies. It gets stuck in a local minimum that verbal reflection can’t escape.

The hand-written heuristics for ALFWorld are a related tell. The LLM couldn’t reliably know when it was stuck, so a human wrote the rule: if the same action repeats 3 times, trigger reflection. If actions exceed 30 steps, trigger reflection.These heuristics mean Reflexion requires task-specific evaluation design on top of task-specific prompt design. Defining failure precisely enough to generate useful reflection is a harder problem than it looks — and it’s a problem the agent can’t solve for itself. That’s not a generalizable solution. It means the evaluation component — the part that decides when to reflect and whatwent wrong — is partially human-designed, not learned.

The bigger picture

Looking back from 2026, Reflexion’s lasting contribution is the proof of concept: natural language can serve as a learning signal. Not gradients, not reward shaping, not fine-tuning — just a paragraph of text that says “here’s what went wrong last time.” The model weights never change, and yet the agent genuinely improves across trials. That’s a remarkable result, even if the conditions under which it works are narrower than you might hope.The idea of iterative verbal self-improvement was in the air. Self-Refine (Madaan et al., 2023) appeared the same month as Reflexion, using a similar loop: generate, get feedback, refine. Where Reflexion stores reflections across trials for multi-attempt tasks, Self-Refine iterates within a single generation. Both proved that natural language feedback can substitute for gradient updates.

Reflexion’s real contribution isn’t the retry loop — it’s the proof that natural language can serve as a learning signal. Its real limitation is that it only works when failure speaks clearly enough to learn from.

The WebShop failure is the most honest result in the paper — and the most important one for understanding where this paradigm ends. No spin, no explanation for why the metric might be misleading. It’s a clean admission that the mechanism breaks down in open-ended environments, and it tells you more about Reflexion’s real limits than any of the benchmarks it succeeds on.

The arc from CoT to ReActto Reflexion traces a clear line. CoT gave models scratch space — the ability to reason before committing. ReAct gave them hands — the ability to act on the world mid-reasoning. Reflexion gave them memory — the ability to learn from failure across attempts. But each addition comes with a condition: memory only helps when failure is diagnosable, just as hands only help when the environment provides useful observations.

Open question

Reflexion works as well as your ability to define what failure looks like — and that turns out to be a hard problem. The authors expected the paradigm to improve as LLMs get better at self-evaluation. In 2026 that’s partially true — models are better at diagnosing their own mistakes in structured domains. But the fundamental question remains: how much can you trust a language model to know when it’s wrong, and to say something useful about why?The limitation cuts both ways. When the evaluator is too precise (exact match), it catches everything but the agent can only fix what it understands. When the evaluator is too vague (WebShop’s binary signal), the reflection is useless. The sweet spot — structured but informative feedback like compiler errors — is also the most domain-specific.

The next question in the series is whether you can go further — not just learning from past failures within a task, but planning ahead across branching possibilities before committing to an action at all.

References

arXiv:2303.11366NeurIPS 2023 2023

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao

Submitted 20 Mar 2023

Open paper

arXiv:2210.03629ICLR 2023 2022

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao

Submitted 6 Oct 2022

Open paper

arXiv:2201.11903NeurIPS 2022 2022

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou

Submitted 28 Jan 2022

Open paper

arXiv:2010.03768ICLR 2021 2021

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, Matthew Hausknecht

Submitted 8 Oct 2020

Open paper

arXiv:1809.09600EMNLP 2018 2018

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, Christopher D. Manning

Submitted 25 Sep 2018

Open paper

arXiv:2107.033742021

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, Wojciech Zaremba

Submitted 7 Jul 2021

Open paper

arXiv:2207.01206NeurIPS 2022 2022

WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents

Shunyu Yao, Howard Chen, John Yang, Karthik Narasimhan

Submitted 4 Jul 2022

Open paper