Replicating Reflexion: What happens when you actually run the code
The Reflexion post ended with an open question: how much can you trust a language model to know when it’s wrong, and to say something useful about why? The paper predicted that stronger models would produce better reflections and improve faster. I reimplemented the framework and ran it with models from 2024 to find out.Shinn, N. et al. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023. The original implementation used text-davinci-003, a completion model OpenAI deprecated in January 2024.
I want to be upfront about what this post is. It’s a reimplementation exercise: I reimplemented the Reflexion code, watched it fail, and tried to understand why. It is not a full replication study. I ran 20 tasks per condition rather than the paper’s 134 — partly cost, partly because that wasn’t the point. Twenty tasks is enough to see patterns, observe failure modes, and develop intuitions about what’s actually happening inside the framework.
The headline observation: within this experiment, swapping in a stronger model didn’t help — and adding reflection actively hurt it. gpt-4o-mini, without reflection: 85%. With reflection: 65%. That’s not a statistical claim at n=20. But it’s a consistent pattern across trials, and it has a mechanism I can trace through the transcripts.
What I built
The setup mirrors the paper: a ReAct agent tries to complete household tasks in ALFWorld, and on failure, a reflector writes a natural-language post-mortem that gets prepended to the next trial’s prompt. Up to 5 trials per task, 20 tasks total, 4 conditions:ALFWorld (Shridhar et al., 2021) is a text-based household task simulator — 134 environments across 6 task types. The agent interacts through natural language commands like go to fridge 1, take knife 2, clean knife 2 with sinkbasin 1.
- GPT-3.5-turbo baseline — ReAct only, no reflection
- GPT-3.5-turbo + Reflexion — ReAct with reflection after failures
- GPT-4o-mini baseline — ReAct only, no reflection
- GPT-4o-mini + Reflexion — ReAct with reflection after failures
Same provider, same API, same prompts. The only variable is model capability. I also used the same few-shot examples as the paper, which trace back to the ReAct repo: real ALFWorld walkthroughs hand-annotated with > think: steps.My gpt-3.5-turbo-0125 is a chat model; the paper’s text-davinci-003 was a completion model. That formatting difference alone caused a 20-point performance swing in my runs when I had a bug in my chat-API port. Anyone porting ReAct or Reflexion prompts from the completion API to a chat API will hit some version of this.
The headline result
These numbers are not comparable to the paper’s. The base models are different, the API format is different, and n=20 per condition isn’t enough for statistical claims. The comparisons that matter are within-experiment: baseline versus Reflexion, weaker model versus stronger.
GPT-3.5 with reflection ended at 50%, without at 45%. A marginal improvement that stabilizes early. GPT-4o-mini with reflection ended at 65%, without at 85%. Stronger model, reflection on, performance goes down — and it stays down across all five trials.At n=20 these differences are directional, not statistically significant. What I can do is look at what actually happened across all five trials of each environment, trace the failure modes, and ask whether there’s a coherent mechanism. There is — and it splits cleanly by model.
Inside the prompt
ALFWorld has a fixed, narrow command vocabulary. You can go to, take, put, open, clean, use. That’s roughly it. “Examine the alarmclock with the desklamp” maps to a specific three-step sequence: find and take the alarmclock, go to the desklamp, type use desklamp 1. The environment doesn’t care about your reasoning — it checks whether you executed the right commands in the right order.
The few-shot examples teach exactly this — not how to think, but what to type. Here’s what one looks like:
Your task is to: put a clean lettuce in diningtable.
> think: I need to find a lettuce, clean it, then put it in diningtable.> go to fridge 1
The fridge 1 is closed.> open fridge 1
In it, you see a cup 3, a egg 2, a potato 3.> go to diningtable 1
On the diningtable 1, you see a lettuce 1…> take lettuce 1 from diningtable 1> go to sinkbasin 1> clean lettuce 1 with sinkbasin 1
You clean the lettuce 1.> put lettuce 1 in/on diningtable 1
Task completed.
The think step names the goal once, and then the model just executes. No mid-task reasoning, no replanning. The examples are motor patterns — they show the model which verbs to use and in what order.
Every trial is built from the same template: few-shot examples, then the memory block, then the current task. More formally, the context on trial is:
where is the fixed few-shot examples, is the current task, and is the memory block:
On trial 1, — the model reads the examples and the task with nothing in between. By trial 4, the memory block is full: three reflections, several hundred tokens of natural-language plans, sitting between the few-shot action patterns and the actual task. That distance is doing a lot of work in the failure mode we’re about to see.
The baseline already knows the pattern
Here’s GPT-4o-mini solving the alarmclock task without reflection. The first think step mirrors the few-shot examples almost verbatim:
> think: I need to find and take an alarm clock, then find and use a desklamp.
… [searches for the alarm clock and desklamp] …> use desklamp 1
Task Completed
It’s not reasoning — it’s pattern-matching against the examples. In ALFWorld, that’s exactly the right move.
Reflection breaks the pattern
Same model, same task, but now reflections sit between and . The first think step has already diverged:
> think: I need to find an alarm clock and then use the desklamp to examine it.> go to desk 2
On the desk 2, you see a alarmclock 1.> examine alarmclock 1 with desklamp 1
Nothing happens.> examine alarmclock 1 with desklamp 1
Nothing happens.> examine alarmclock 1 with desklamp 1
Nothing happens.
… [keeps looping] …
Instead of use desklamp 1, it produces examine alarmclock 1 with desklamp 1 — a command that doesn’t exist in ALFWorld’s vocabulary. The reflections between and have pushed the model away from the action patterns it was copying correctly without them.
Six failure patterns
Looking at individual reflections tells you what the model said. To understand whether reflection actually helped, you have to look at the full arc — all five trials of each environment, side by side. I had three independent LLM agents classify each of the 20 failed environments by their overall trajectory pattern. They agreed on 60% of labels unanimously; the remaining disagreements were edge cases between adjacent categories. Six patterns emerged.I spot-checked all labeling decisions manually. The LLM annotations were used to speed classification of the trajectories, not to substitute for reading them. Two of the six patterns account for 60% of all failures — and they split cleanly by model.
Syntax wall — 30%, mostly GPT-3.5
The agent has the right strategy. It finds the object, performs the intermediate step, goes to the destination. Then it hits the final action and can’t produce the four characters ALFWorld needs.
The task is “put a cool mug in cabinet.”
> put mug 3 in cabinet 1→ Nothing happens.> put cool mug 3 in cabinet 1→ Nothing happens.> put the cooled mug 3 in cabinet 1→ Nothing happens.
The few-shot examples show put X in/on Y. The model writes put X in Y. Five trials, dozens of put attempts, never produces in/on. The reflection after this trial says: “I should go to cabinet 1, open it, and then put the cooled mug 3 insidethe cabinet.” Not in/on cabinet 1. The reflector operates in natural language — it says what to do, but natural language doesn’t preserve the exact command token ALFWorld requires.GPT-3.5 hits the in/on wall 523 times across its runs. GPT-4o-mini hits it only 67 times — it’s better at copying format. But neither model’s reflector ever catches it. Zero out of 89 reflections mention the in/on syntax.
Degradation over trials — 30%, mostly GPT-4o-mini
This is the pattern I didn’t expect. GPT-3.5 fails repeatedly but stably — it hits syntax walls and stays stuck. GPT-4o-mini doesn’t just fail to improve. It degrades.
The saltshaker-on-drawer task. Trial 1: the agent searches the room, finds the saltshaker, picks it up, goes to the drawer, tries several put commands. Clean ALFWorld actions throughout, 34 steps, all valid. Trial 2, first action:
> To solve the task of putting a saltshaker on a drawer, I will follow my plan step by step.
Nothing happens.> Let’s reassess the situation and ensure we follow the plan correctly.
Nothing happens.
The model has stopped using the action syntax. Instead of > think: followed by a command, it outputs free-form text — narrating a plan, not executing one. All 49 actions in this trial are plain prose. Zero valid ALFWorld commands.
The most plausible mechanism: GPT-4o-mini’s reflections are verbose and structured — numbered steps, bold headers, multi-paragraph plans. By trial 3, two of these sit in the prompt. The few-shot examples at the top get drowned out. GPT-3.5 doesn’t do this nearly as much — its reflections are shorter, messier, run-on paragraphs rather than numbered plans. They don’t trigger the mode switch.Ironically, the weaker model’s lower-quality reflections cause less damage. GPT-3.5’s short, messy reflections stay out of the way. GPT-4o-mini’s polished, structured reflections compete with the few-shot examples for the model’s attention.
Frozen reflection — 15%
The agent fails, reflects, and then writes the same reflection four more times. Sometimes word for word. Sometimes with one sentence appended. The sliding window means the actor sees three copies of the same plan in its prompt. Three times it reads “put the cooled mug inside the cabinet.” Three times it produces put mug 3 in cabinet 1. Three times ALFWorld returns “Nothing happens.”
This is partly a framework design issue. The reflector prompt asks for a “New plan:” but nothing tells it that its output will be appended alongside previous plans, not replacing them. So it writes a complete standalone plan each time — naturally 90% identical to the previous one.This is what the literature calls “degeneration of thought.” Retroformer (Yao et al., 2023) explicitly documents the same phenomenon — “uninformative self-reflections” where the reflector rephrases the prior plan, which prompts the agent to repeat those exact steps.
Reflection hurt — 10%, GPT-3.5 only
Sometimes the reflection doesn’t just fail to help — it introduces a wrong hypothesis that makes subsequent trials worse. The task is “put a cool tomato in microwave.” Trial 1 does the right thing: finds the tomato, cools it, goes to the microwave. The put command fails (syntax). The reflection concludes:
“The task simply required putting a cool tomato in the microwave. The correct action should have been to take the tomato directly from the countertop and put it in the microwave without cooling it first.”
Cooling is required — it’s a pick_cool_then_place task. But the reflector, unable to diagnose the real problem, invents a plausible-sounding alternative. The model follows it faithfully for the next four trials, skipping the cooling step each time. Every trial fails.
What reflections blame instead
I categorized all 89 reflections across both models with three independent blind annotators (89% agreement on content categories). The picture that emerges is stark: the dominant behavior for both models is repeating the previous plan.60–78% of reflections are near-verbatim copies of the previous trial’s plan. The reflection framework’s prompt design doesn’t tell the reflector that its output will be shown alongside previous plans, so it rewrites the whole plan each time. The repetition is structural, not a capability gap.
in/on syntax as the failure cause.Here’s GPT-3.5’s reflection on the cool-mug-in-cabinet task:
“After cooling the mug with the fridge, I will go to cabinet 1, open it, and then put the cooled mug 3 inside the cabinet. This will directly address the task at hand and avoid unnecessary attempts at placing the mug in other locations.”
And here’s GPT-4o-mini on the same task:
Plan:
1. First, I will find and take the mug from countertop 1.
2. Next, I will cool the mug in the fridge.
3. Before attempting to put the mug in a cabinet, I will check if the cabinet is suitable for placing a mug by examining its contents to ensure it has space.
4. I will then put the cooled mug in the first cabinet that is confirmed to be suitable…
The GPT-4o-mini reflection is longer, more structured, and more confident. It has numbered steps and bold headers. It looks better. But it’s wrong in exactly the same way — it’s a strategy-level plan for a syntax-level problem. Both reflections say “put the mug in the cabinet.” Neither says “try put mug 3 in/on cabinet 1.”
What the literature already knew
After writing up these results, I went looking for prior work. I wasn’t the first to notice any of this.
Huang et al. showed that without an external oracle signal, LLM performance typically degrades after self-correction. The mechanism is the same one my data points at: the model can’t reliably distinguish a correct trajectory from a flawed one, so it “fixes” things that aren’t broken.Huang, J. et al. (2023). Large Language Models Cannot Self-Correct Reasoning Yet. ICLR 2024. The paper surveyed self-correction claims across the literature and found that most successes used oracle feedback (the ground truth) to guide correction — not the model’s own judgment.
Kamoi et al. surveyed the broader literature and reached a stark conclusion: no prior work demonstrates successful self-correction with feedback from prompted LLMs, except in tasks that are exceptionally suited to it. Self-correction works when reliable external feedback is available. Otherwise, it doesn’t.The consistent empirical condition across this literature: self-correction is most effective when initial accuracy is low, question difficulty is high, and external verification is available. For high-performing base models on easier tasks, reflection reliably hurts. GPT-4o-mini on ALFWorld fits that profile exactly.
Retroformer is particularly relevant. It explicitly documents what the authors call “uninformative self-reflections” from frozen LLMs: the reflector rephrases prior failed action sequences as the proposed plan, which prompts the agent to repeat those steps in the next trial. My “frozen reflection” failure pattern is the same phenomenon. Retroformer’s response is to fine-tune the reflector with RL rather than relying on a frozen model — a very different architectural direction than just swapping in a stronger base model.Yao, W. et al. (2023). Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization. ICLR 2024.
So the general finding — stronger models hurt more by reflection — is consensus by mid-2024. What I didn’t find in the literature: the specific mode-switch failure where accumulated reflections cause the model to abandon the action syntax entirely, outputting free-form prose instead of environment commands. The general principle is known; this specific mechanism isn’t.
What I learned from running this
Separate from the results, the process of actually running this code taught me things I wouldn’t have gotten from reading the paper.
Start from the reference implementation, not the paper’s description. Most bugs came from reimplementing based on how the paper describes the system rather than how the code actually works. The consecutive user message bug, the evaluator step counting, the observation preprocessing — all of these are in the repo but not in the paper.The chat-API migration alone silently cost 20 percentage points. No error, no warning. The prompt looked right. The results just quietly degraded. I only caught it because 45% was the number the paper reported, and 25% was suspiciously far off.
Benchmarks age faster than you expect. ALFWorld was a meaningful test of agent reasoning in 2022. In 2026, its text parser is the bottleneck — and that’s not what anyone set out to evaluate. The single biggest failure mode in my experiment is a four-character string that the benchmark’s parser requires and that verbal reflection fundamentally cannot diagnose. Before running the next reimplementation, I want to spend more time asking whether the benchmark is still measuring what the paper claims.
The bigger picture
The Reflexion paper’s bet was that stronger models would produce better reflections. In this experiment, the opposite happened — and the mechanism is instructive.The idea of iterative verbal self-improvement was in the air in early 2023. Self-Refine (Madaan et al., 2023) appeared the same month as Reflexion, using a similar generate → feedback → refine loop. Both proved that natural language feedback can substitute for gradient updates — but both hit the same wall when the feedback isn’t grounded.
Verbal reflection can’t fix problems at a level of abstraction it doesn’t operate at. The reflector reasons about what to do. The failures are about how to phrase the command.
That’s not a bug in GPT-4o-mini — it’s a fundamental mismatch between the reflection mechanism and the failure mode. The reflector operates in natural language. The failure is syntactic. Natural language is the wrong medium for diagnosing syntax errors in a text parser’s command vocabulary. And the stronger model’s more confident, more structured reflections don’t bridge this gap — they widen it, by pushing the prompt further from the action patterns that were already working.
The Reflexion post concluded that the mechanism works as well as your ability to define what failure looks like. This experiment adds a sharper version of the same claim: even when failure is definable, reflection only helps if the failure lives at the level of abstraction that natural language can reach. When it doesn’t — when the problem is a four-character string, or a mode switch caused by prompt structure — more capable reflection makes things worse, not better.
Caveats
This is a preliminary study on 20 environments per condition, not a conclusive result. A few specific confounds worth naming: GPT-3.5-turbo is not text-davinci-003 — the original model was deprecated, so the baseline is approximate. The in/on syntax issue may be specific to ALFWorld — other environments with more flexible parsers might show different results. The degradation pattern, while consistent across the 5 environments where it appeared, needs validation at larger scale. Maybe these findings don’t generalize beyond this benchmark and these models. But the mechanism I can trace — verbose reflections drowning out few-shot patterns, reflection reasoning at the wrong level of abstraction — seems general enough to be worth investigating further.