Self-Refine: What happens when you let an LLM critique its own work
Writers revise drafts. Programmers refactor code. Students re-read their essays before submitting. Iterative self-improvement is so fundamental to human work that we rarely think about it — and yet, we typically ask language models to get everything right on the first try.Madaan, A., Tandon, N., Gupta, P., et al. (2023). Self-Refine: Iterative Refinement with Self-Feedback. NeurIPS 2023.
Self-Refine tests a simple idea: what if you let the same model that generated the output critique it and try again? No additional training. No external reward signal. No separate critic model. Just three prompts and a loop — generate, get feedback, refine — all from the same LLM.
The results hold across every task they tested — seven in total, spanning dialogue, code, math, and creative generation. In every case, humans prefer the Self-Refine output to the base model’s first attempt.
The simplest idea that works
The entire mechanism is three prompts and a while loop. Given an input, the model generates an initial output. Then, using a different prompt, the same model critiques that output — identifying specific problems and suggesting improvements. Finally, using a third prompt, the model revises its original output based on the critique. Repeat until the feedback says “it looks good” or you hit a maximum iteration count.The key architectural decision: all three prompts go to the same model. Prior work on iterative refinement typically trained separate “critic” or “editor” models. Self-Refine shows that a single strong LLM can play all three roles — generator, critic, and editor — using only few-shot prompting.
Step through the visualization below to see both iterations. After the first refinement, the loop circles back to get feedback on the improved output. This time the feedback says “no further changes needed” and the loop stops. Toggle between dialogue and code to see the same cycle in both domains.
The important moment is iteration 2. The loop doesn’t just refine and stop — it goes back to the feedback step and asks again. Only when the critique finds nothing to fix does the stopping condition trigger. That second pass is what makes this a loop, not a pipeline: the model must judge its own improvement before it’s allowed to finish.Both examples are drawn directly from the paper’s Figure 2. In practice, Self-Refine runs up to 4 iterations. The stopping condition is either a maximum iteration count or the model generating a “no issues found” signal in its feedback. The paper also supports extracting a scalar stop score from the feedback text.
Given input and a model , Self-Refine first generates an initial output:
The model then critiques its own output using a feedback prompt:
Finally, the model refines its output given the full history of prior attempts and feedback:
Steps 2 and 3 repeat: after each refinement, the model generates new feedback on . The loop terminates when — either the feedback says “looks good” or a maximum iteration count is reached.Compare this to Reflexion, which also accumulates history but across separate task attempts. Self-Refine accumulates within a single generation. Both use natural language as the “gradient” — but Reflexion optimizes across trials while Self-Refine optimizes within one.
The math makes it look clean, but the mechanism lives in the prompts. Each of the three calls — generate, feedback, refine — uses a different few-shot prompt that teaches the model its role. The feedback prompt is the most interesting: it shows the model examples of multi-aspect critiques, and the model learns to produce the same structured analysis for new outputs.The prompts are task-specific — each of the seven tasks has its own set of few-shot examples for generation, feedback, and refinement. But the three-step structure is identical across all tasks. The full prompt templates are available in the paper’s Appendix S and at selfrefine.info.
Review the response and provide specific feedback on each of the following aspects: Relevant, Engaging, User understanding, Content richness.
Context: "What's the best way to cook pasta?" Response: "The best way to cook pasta is to..." Feedback: Relevant: Addresses the question but lacks detail on technique. Engaging: Dry and list-like. Doesn't invite further conversation. User understanding: Doesn't consider whether the user is a beginner or experienced cook. Content richness: Missing tips on timing, salt ratios, or sauce pairing.
Context: "I am interested in playing Table tennis." Response: "I'm sure it's a great way to socialize, stay active." Feedback:
Click through the three tabs above. The feedback prompt is the one doing the heavy lifting: by showing the model examples of structured, multi-aspect critiques, it learns to decompose “what’s wrong” into specific, addressable dimensions. The refine prompt then gives it examples of how to act on that structured feedback. This is why generic feedback (“improve this”) fails — it bypasses the structured decomposition that the few-shot examples teach.
Seven tasks, one loop
The evaluation covers seven tasks designed to stress-test the loop on different kinds of output: dialogue, sentiment transfer, acronym generation, code optimization, code readability, math reasoning, and constrained generation (building a sentence from up to 30 given keywords). The question isn’t whether Self-Refine helps on one task — it’s whether the same loop structure generalizes.Prior iterative refinement work (PEER, Self-Correction) trained separate refiners per task. Self-Refine uses the same loop for all seven, changing only the few-shot examples in each prompt.
The gap between base GPT-4 and +Self-Refine varies wildly by task — and the pattern tells you where self-critique has traction.
Dialogue Response is the most dramatic: base GPT-4 scores 25.4% preference, +Self-Refine reaches 74.6% — a 49-point gain. Constrained Generation jumps from 15.0% to 45.0%. These are the tasks where the initial output has the most room for improvement and where the model can identify specific shortcomings.Is this just because Self-Refine generates more output? The paper tests this: they compare Self-Refine against generating k=4 independent samples and picking the best one. Humans still prefer Self-Refine’s output over all four samples. Feedback-guided revision beats undirected sampling.Lin, B. Y. et al. (2020). CommonGen: A Constrained Text Generation Challenge. Findings of EMNLP 2020. Constrained Generation requires including up to 30 given keywords in a coherent sentence. The model frequently misses concepts on the first try. Self-Refine’s feedback catches the missing keywords and the refinement incorporates them.
Math Reasoning is the outlier: 92.9% → 93.1%, a gain of just 0.2 points. That’s not noise — it reflects a real limitation. The model’s own feedback on math is often useless. ChatGPT generates “everything looks good” for 94% of math instances, even when the answer is wrong.Cobbe, K. et al. (2021). Training Verifiers to Solve Math Word Problems. GSM8K. The math benchmark is GSM8K — 1,319 grade-school math problems. The paper shows that when an external oracle identifies incorrect answers, Self-Refine gains jump to 5%+. The bottleneck isn’t revision ability — it’s self-evaluation.
The pattern holds across model sizes — GPT-3.5, ChatGPT, and GPT-4 all improve — but the striking finding is that GPT-4 + Self-Refine beats ChatGPT + Self-Refine even on tasks where base GPT-4 scored lowerthan base ChatGPT. Stronger models don’t just generate better first drafts. They generate better critiques.
Specific feedback is the mechanism
Is it the iteration that helps, or the feedback itself? The paper answers this with a clean ablation: replace Self-Refine’s specific, multi-aspect feedback with generic instructions like “Improve the efficiency of the code” and keep everything else — the loop, the prompts, the model — exactly the same.
The results are stark. In Sentiment Reversal, specific feedback scores 43.2, generic feedback drops to 31.2, and no feedback at all scores zero — the model can’t reverse sentiment without being told what’s wrong. Code Optimization shows a smaller but consistent gradient: 27.5 → 26.0 → 24.8.This parallels what the Reflexion paper found with ALFWorld vs WebShop. In ALFWorld, the failure signal was specific enough to produce actionable reflections. In WebShop, the signal was too vague, and the reflections were useless platitudes. Both papers converge on the same insight: the quality of the critique determines the quality of the improvement.
Actionable feedback means naming the problem. “Avoid repeated calculations in the for loop” beats “Improve the efficiency of the code” because it gives the refiner a specific target. The model already knows how to fix things — it needs to be told what’s broken.
Diminishing returns, not zero returns
Most improvement comes from the first feedback-refine iteration. But every subsequent iteration still helps. The iteration curve tells you something the aggregate results don’t: where the model’s ability to critique itself runs out.
Constrained Generation gains 20.7 points total across three iterations, but 11.3 of that comes from y₀ → y₁ alone. Code Optimization follows the same curve: 5.0 points in the first iteration, 1.8 more across the next two.The performance doesn’t always increase monotonically. In multi-aspect tasks like Acronym Generation, improving one quality dimension (pronounceability) can degrade another (relevance to the title). The paper handles this by generating numerical scores for each aspect and selecting the best output across iterations, not just the last one.
The diminishing returns suggest the model is converging on its quality ceiling for the task. The first iteration catches the most obvious flaws. Each subsequent iteration finds subtler issues. Eventually the model’s ability to critique plateaus — it can’t identify problems it doesn’t recognize.
Where self-critique breaks down
Two failure modes emerge clearly. The first is mathematical reasoning, where the model can’t reliably identify errors in its own chains. A consistent-looking derivation can deceive even the critic. When the paper adds an external oracle that flags incorrect answers, Self-Refine on math jumps by 5%+ — the model can fix errors once told where they are. The problem isn’t revision ability. It’s self-evaluation.Kamoi, R. et al. (2024). When Can LLMs Actually Correct Their Own Mistakes? TACL 2024. This comprehensive survey confirmed the pattern: no prior work demonstrates successful self-correction with prompted feedback alone, except in tasks “exceptionally suited” for it — like code, where outputs can be verified. Self-Refine’s math failure foreshadowed a finding the field spent two years confirming.
The comparison below makes this concrete. On the left, code optimization: the feedback names the problem (nested loops), suggests the fix (dynamic programming), and the refiner implements it. On the right, math: the model gets the answer wrong, the feedback says “looks correct,” and nothing changes.
The second failure mode is model size. The main experiments use GPT-3.5 and GPT-4 — large, instruction-tuned models. When the paper tries Self-Refine with Vicuna-13B, a much smaller open-source model (13 billion parameters vs GPT-4’s estimated ~1.8 trillion), it breaks down. Vicuna can generate initial outputs, but it can’t reliably produce structured feedback. Instead of critiquing, it generates generic assistant-like responses or simply repeats its original output. Self-Refine needs the model to be capable enough to be its own teacher — and at 13B parameters, it isn’t.An interesting hybrid: the paper experiments with using Vicuna-13B for initial generation and ChatGPT for feedback and refinement. On Math Reasoning, Vicuna alone reaches 24.18%. With ChatGPT handling the critique, it improves to 40.5%. The bottleneck is the weakest role, not the average.
The bigger picture
Self-Refine appeared in March 2023 — the same month as Reflexion. The two papers discovered the same fundamental insight from opposite directions. Reflexion showed that an agent can improve across complete task attempts by reflecting on failures in natural language. Self-Refine showed that a model can improve within a single generation by critiquing and revising its own draft. Both proved that natural language can serve as a learning signal — no gradients, no fine-tuning, just text.The comparison to Welleck et al. (2022). Self-Correction. is instructive. Self-Correction trains a separate refiner model per task. Self-Refine uses the same base model for everything, with only the prompts changing. On GSM8K with the same base model (GPT-3), Self-Correction reaches 45.9%; Self-Refine reaches 55.7%. Prompting beats training here because it accesses the full model’s capabilities rather than a task-specific fine-tune.
Self-Refine’s real contribution isn’t the refinement loop — it’s the proof that a model can meaningfully evaluate and improve its own work, using nothing but prompting. Its real limitation is that this only works when the model can actually tell what’s wrong.
The distinction between Self-Refine and Reflexion maps onto a distinction that runs through this whole series. Self-Refine is about output quality within a turn — making the draft better. Reflexion is about learning across turns — getting smarter about the task itself. One polishes; the other learns. Together, they define the space between single-pass generation and traditional reinforcement learning.
The arc from CoT through ReAct and Reflexion to Self-Refine traces a progression in what language models do with their own outputs. CoT let models think before answering. ReAct let them act on the world mid-thought. Reflexion let them learn from failure across attempts. Self-Refine adds the final piece: the ability to critique and revise in real time.
The writing process, as a prompting strategy.Bai, Y. et al. (2022). Constitutional AI: Harmlessness from AI Feedback. Anthropic’s Constitutional AI uses a similar self-critique loop for alignment — the model evaluates its own responses against a set of principles and revises them. Where Self-Refine optimizes task quality, RLHF + CAI optimizes for safety and helpfulness. Same mechanism, different objective function.
Open question
The math results expose the ceiling: a model can only improve what it can evaluate. When the task requires identifying subtle errors in reasoning chains, self-critique fails — the same blind spots that produced the error also prevent the model from seeing it. The 94% false-positive rate on ChatGPT math feedback isn’t an implementation detail. It’s the fundamental constraint.
The question Self-Refine leaves open is whether this constraint is permanent or temporary. As models get stronger, does self-evaluation improve proportionally? Or is there something inherently circular about asking a system to judge the limits of its own knowledge?Subsequent work has sharpened this question. Huang et al. (ICLR 2024) argued LLMs “cannot self-correct reasoning yet.” Kamoi et al.’s TACL 2024 survey confirmed it systematically: self-correction only works with reliable external feedback or large-scale fine-tuning. But Self-Refine’s code results show the boundary isn’t “can vs. can’t” — it’s whether the model can verify the output in its head. Code: yes. Math: no.
The math results suggest the second. The code results suggest the first — models are excellent at critiquing code because they can mentally execute it and check against the spec. The boundary between these cases — where self-critique works and where it doesn’t — remains the central open question in this line of work.