Self-Refine: What happens when you let an LLM critique its own work


Writers revise drafts. Programmers refactor code. Students re-read their essays before submitting. Iterative self-improvement is so fundamental to human work that we rarely think about it — and yet, we typically ask language models to get everything right on the first try.Madaan, A., Tandon, N., Gupta, P., et al. (2023). Self-Refine: Iterative Refinement with Self-Feedback. NeurIPS 2023.

Self-Refine tests a simple idea: what if you let the same model that generated the output critique it and try again? No additional training. No external reward signal. No separate critic model. Just three prompts and a loop — generate, get feedback, refine — all from the same LLM.

The results hold across every task they tested — seven in total, spanning dialogue, code, math, and creative generation. In every case, humans prefer the Self-Refine output to the base model’s first attempt.


The simplest idea that works

The entire mechanism is three prompts and a while loop. Given an input, the model generates an initial output. Then, using a different prompt, the same model critiques that output — identifying specific problems and suggesting improvements. Finally, using a third prompt, the model revises its original output based on the critique. Repeat until the feedback says “it looks good” or you hit a maximum iteration count.The key architectural decision: all three prompts go to the same model. Prior work on iterative refinement typically trained separate “critic” or “editor” models. Self-Refine shows that a single strong LLM can play all three roles — generator, critic, and editor — using only few-shot prompting.

Step through the visualization below to see both iterations. After the first refinement, the loop circles back to get feedback on the improved output. This time the feedback says “no further changes needed” and the loop stops. Toggle between dialogue and code to see the same cycle in both domains.

The Self-Refine loop
InputGenerateFeedbackRefineFeedback → stop?
Input
I am interested in playing Table tennis.
1 / 6

The important moment is iteration 2. The loop doesn’t just refine and stop — it goes back to the feedback step and asks again. Only when the critique finds nothing to fix does the stopping condition trigger. That second pass is what makes this a loop, not a pipeline: the model must judge its own improvement before it’s allowed to finish.Both examples are drawn directly from the paper’s Figure 2. In practice, Self-Refine runs up to 4 iterations. The stopping condition is either a maximum iteration count or the model generating a “no issues found” signal in its feedback. The paper also supports extracting a scalar stop score from the feedback text.

The formal intuition

Given input and a model , Self-Refine first generates an initial output:

The model then critiques its own output using a feedback prompt:

Finally, the model refines its output given the full history of prior attempts and feedback:

Steps 2 and 3 repeat: after each refinement, the model generates new feedback on . The loop terminates when — either the feedback says “looks good” or a maximum iteration count is reached.Compare this to Reflexion, which also accumulates history but across separate task attempts. Self-Refine accumulates within a single generation. Both use natural language as the “gradient” — but Reflexion optimizes across trials while Self-Refine optimizes within one.

The math makes it look clean, but the mechanism lives in the prompts. Each of the three calls — generate, feedback, refine — uses a different few-shot prompt that teaches the model its role. The feedback prompt is the most interesting: it shows the model examples of multi-aspect critiques, and the model learns to produce the same structured analysis for new outputs.The prompts are task-specific — each of the seven tasks has its own set of few-shot examples for generation, feedback, and refinement. But the three-step structure is identical across all tasks. The full prompt templates are available in the paper’s Appendix S and at selfrefine.info.

Inside the prompts
How each step is instructed · Dialogue Response task
p_fb — FeedbackThe same model critiques its own output across multiple quality dimensions.
Instruction
Review the response and provide specific feedback on each of the following aspects: Relevant, Engaging, User understanding, Content richness.
Few-shot example (input + output + critique)
Context: "What's the best way to cook pasta?"
Response: "The best way to cook pasta is to..."

Feedback:
  Relevant: Addresses the question but lacks detail on technique.
  Engaging: Dry and list-like. Doesn't invite further conversation.
  User understanding: Doesn't consider whether the user is a beginner or experienced cook.
  Content richness: Missing tips on timing, salt ratios, or sauce pairing.
Actual task → model critiques its own y₀
Context: "I am interested in playing Table tennis."
Response: "I'm sure it's a great way to socialize, stay active."

Feedback:

Click through the three tabs above. The feedback prompt is the one doing the heavy lifting: by showing the model examples of structured, multi-aspect critiques, it learns to decompose “what’s wrong” into specific, addressable dimensions. The refine prompt then gives it examples of how to act on that structured feedback. This is why generic feedback (“improve this”) fails — it bypasses the structured decomposition that the few-shot examples teach.


Seven tasks, one loop

The evaluation covers seven tasks designed to stress-test the loop on different kinds of output: dialogue, sentiment transfer, acronym generation, code optimization, code readability, math reasoning, and constrained generation (building a sentence from up to 30 given keywords). The question isn’t whether Self-Refine helps on one task — it’s whether the same loop structure generalizes.Prior iterative refinement work (PEER, Self-Correction) trained separate refiners per task. Self-Refine uses the same loop for all seven, changing only the few-shot examples in each prompt.

The gap between base GPT-4 and +Self-Refine varies wildly by task — and the pattern tells you where self-critique has traction.

Self-Refine improves every task
GPT-4 base vs GPT-4 + Self-Refine · 7 tasks
Dialogue Response
25.474.6(+49.2)
Sentiment Reversal
3.836.2(+32.4)
Constrained Gen.
1545(+30.0)
Code Readability
27.456.2(+28.8)
Acronym Gen.
30.456(+25.6)
Code Optimization
27.336(+8.7)
Math Reasoning
92.993.1(+0.2)
Base GPT-4+ Self-Refine improvement

Dialogue Response is the most dramatic: base GPT-4 scores 25.4% preference, +Self-Refine reaches 74.6% — a 49-point gain. Constrained Generation jumps from 15.0% to 45.0%. These are the tasks where the initial output has the most room for improvement and where the model can identify specific shortcomings.Is this just because Self-Refine generates more output? The paper tests this: they compare Self-Refine against generating k=4 independent samples and picking the best one. Humans still prefer Self-Refine’s output over all four samples. Feedback-guided revision beats undirected sampling.Lin, B. Y. et al. (2020). CommonGen: A Constrained Text Generation Challenge. Findings of EMNLP 2020. Constrained Generation requires including up to 30 given keywords in a coherent sentence. The model frequently misses concepts on the first try. Self-Refine’s feedback catches the missing keywords and the refinement incorporates them.

Math Reasoning is the outlier: 92.9% → 93.1%, a gain of just 0.2 points. That’s not noise — it reflects a real limitation. The model’s own feedback on math is often useless. ChatGPT generates “everything looks good” for 94% of math instances, even when the answer is wrong.Cobbe, K. et al. (2021). Training Verifiers to Solve Math Word Problems. GSM8K. The math benchmark is GSM8K — 1,319 grade-school math problems. The paper shows that when an external oracle identifies incorrect answers, Self-Refine gains jump to 5%+. The bottleneck isn’t revision ability — it’s self-evaluation.

The pattern holds across model sizes — GPT-3.5, ChatGPT, and GPT-4 all improve — but the striking finding is that GPT-4 + Self-Refine beats ChatGPT + Self-Refine even on tasks where base GPT-4 scored lowerthan base ChatGPT. Stronger models don’t just generate better first drafts. They generate better critiques.


Specific feedback is the mechanism

Is it the iteration that helps, or the feedback itself? The paper answers this with a clean ablation: replace Self-Refine’s specific, multi-aspect feedback with generic instructions like “Improve the efficiency of the code” and keep everything else — the loop, the prompts, the model — exactly the same.

Specific feedback is the mechanism
Same refinement loop, three levels of feedback quality
No feedback
Refine without critique
Generic
"Improve the output"
Self-Refine
Specific, actionable
Code Optimization
24.8
26.0
27.5
Sentiment Reversal
0
31.2
43.2
Acronym Generation
48.0
54.0
56.4

The results are stark. In Sentiment Reversal, specific feedback scores 43.2, generic feedback drops to 31.2, and no feedback at all scores zero — the model can’t reverse sentiment without being told what’s wrong. Code Optimization shows a smaller but consistent gradient: 27.5 → 26.0 → 24.8.This parallels what the Reflexion paper found with ALFWorld vs WebShop. In ALFWorld, the failure signal was specific enough to produce actionable reflections. In WebShop, the signal was too vague, and the reflections were useless platitudes. Both papers converge on the same insight: the quality of the critique determines the quality of the improvement.

Actionable feedback means naming the problem. “Avoid repeated calculations in the for loop” beats “Improve the efficiency of the code” because it gives the refiner a specific target. The model already knows how to fix things — it needs to be told what’s broken.


Diminishing returns, not zero returns

Most improvement comes from the first feedback-refine iteration. But every subsequent iteration still helps. The iteration curve tells you something the aggregate results don’t: where the model’s ability to critique itself runs out.

Diminishing returns, not zero returns
Score after each feedback-refine iteration · averaged across GPT-3.5, ChatGPT, GPT-4
20
30
40
50
y₀
y₁
y₂
y₃
2949.733.936.82228.8
Constrained GenerationSentiment ReversalCode Optimization
The biggest jump is always y₀ → y₁. Each subsequent iteration adds less. Constrained Generation gains +20.7 total across three iterations, but +11.3 of that comes from the first.

Constrained Generation gains 20.7 points total across three iterations, but 11.3 of that comes from y₀ → y₁ alone. Code Optimization follows the same curve: 5.0 points in the first iteration, 1.8 more across the next two.The performance doesn’t always increase monotonically. In multi-aspect tasks like Acronym Generation, improving one quality dimension (pronounceability) can degrade another (relevance to the title). The paper handles this by generating numerical scores for each aspect and selecting the best output across iterations, not just the last one.

The diminishing returns suggest the model is converging on its quality ceiling for the task. The first iteration catches the most obvious flaws. Each subsequent iteration finds subtler issues. Eventually the model’s ability to critique plateaus — it can’t identify problems it doesn’t recognize.


Where self-critique breaks down

Two failure modes emerge clearly. The first is mathematical reasoning, where the model can’t reliably identify errors in its own chains. A consistent-looking derivation can deceive even the critic. When the paper adds an external oracle that flags incorrect answers, Self-Refine on math jumps by 5%+ — the model can fix errors once told where they are. The problem isn’t revision ability. It’s self-evaluation.Kamoi, R. et al. (2024). When Can LLMs Actually Correct Their Own Mistakes? TACL 2024. This comprehensive survey confirmed the pattern: no prior work demonstrates successful self-correction with prompted feedback alone, except in tasks “exceptionally suited” for it — like code, where outputs can be verified. Self-Refine’s math failure foreshadowed a finding the field spent two years confirming.

The comparison below makes this concrete. On the left, code optimization: the feedback names the problem (nested loops), suggests the fix (dynamic programming), and the refiner implements it. On the right, math: the model gets the answer wrong, the feedback says “looks correct,” and nothing changes.

Why self-feedback fails on math
Same loop, same model · code from Figure 5
Code OptimizationFeedback works
Task
Find the minimum cost to pay a given amount using coins of 200 and 300, priced at 380 and 550.
Math ReasoningFeedback fails
Task
Janet has 3 times as many marbles as Tom. Tom has 12 marbles. Janet gives half her marbles to Tom. How many does Tom have now?
1 / 4

The second failure mode is model size. The main experiments use GPT-3.5 and GPT-4 — large, instruction-tuned models. When the paper tries Self-Refine with Vicuna-13B, a much smaller open-source model (13 billion parameters vs GPT-4’s estimated ~1.8 trillion), it breaks down. Vicuna can generate initial outputs, but it can’t reliably produce structured feedback. Instead of critiquing, it generates generic assistant-like responses or simply repeats its original output. Self-Refine needs the model to be capable enough to be its own teacher — and at 13B parameters, it isn’t.An interesting hybrid: the paper experiments with using Vicuna-13B for initial generation and ChatGPT for feedback and refinement. On Math Reasoning, Vicuna alone reaches 24.18%. With ChatGPT handling the critique, it improves to 40.5%. The bottleneck is the weakest role, not the average.


The bigger picture

Self-Refine appeared in March 2023 — the same month as Reflexion. The two papers discovered the same fundamental insight from opposite directions. Reflexion showed that an agent can improve across complete task attempts by reflecting on failures in natural language. Self-Refine showed that a model can improve within a single generation by critiquing and revising its own draft. Both proved that natural language can serve as a learning signal — no gradients, no fine-tuning, just text.The comparison to Welleck et al. (2022). Self-Correction. is instructive. Self-Correction trains a separate refiner model per task. Self-Refine uses the same base model for everything, with only the prompts changing. On GSM8K with the same base model (GPT-3), Self-Correction reaches 45.9%; Self-Refine reaches 55.7%. Prompting beats training here because it accesses the full model’s capabilities rather than a task-specific fine-tune.

Self-Refine’s real contribution isn’t the refinement loop — it’s the proof that a model can meaningfully evaluate and improve its own work, using nothing but prompting. Its real limitation is that this only works when the model can actually tell what’s wrong.

The distinction between Self-Refine and Reflexion maps onto a distinction that runs through this whole series. Self-Refine is about output quality within a turn — making the draft better. Reflexion is about learning across turns — getting smarter about the task itself. One polishes; the other learns. Together, they define the space between single-pass generation and traditional reinforcement learning.

The arc from CoT through ReAct and Reflexion to Self-Refine traces a progression in what language models do with their own outputs. CoT let models think before answering. ReAct let them act on the world mid-thought. Reflexion let them learn from failure across attempts. Self-Refine adds the final piece: the ability to critique and revise in real time.

The writing process, as a prompting strategy.Bai, Y. et al. (2022). Constitutional AI: Harmlessness from AI Feedback. Anthropic’s Constitutional AI uses a similar self-critique loop for alignment — the model evaluates its own responses against a set of principles and revises them. Where Self-Refine optimizes task quality, RLHF + CAI optimizes for safety and helpfulness. Same mechanism, different objective function.


Open question

The math results expose the ceiling: a model can only improve what it can evaluate. When the task requires identifying subtle errors in reasoning chains, self-critique fails — the same blind spots that produced the error also prevent the model from seeing it. The 94% false-positive rate on ChatGPT math feedback isn’t an implementation detail. It’s the fundamental constraint.

The question Self-Refine leaves open is whether this constraint is permanent or temporary. As models get stronger, does self-evaluation improve proportionally? Or is there something inherently circular about asking a system to judge the limits of its own knowledge?Subsequent work has sharpened this question. Huang et al. (ICLR 2024) argued LLMs “cannot self-correct reasoning yet.” Kamoi et al.’s TACL 2024 survey confirmed it systematically: self-correction only works with reliable external feedback or large-scale fine-tuning. But Self-Refine’s code results show the boundary isn’t “can vs. can’t” — it’s whether the model can verify the output in its head. Code: yes. Math: no.

The math results suggest the second. The code results suggest the first — models are excellent at critiquing code because they can mentally execute it and check against the spec. The boundary between these cases — where self-critique works and where it doesn’t — remains the central open question in this line of work.

References
arXiv:2303.17651NeurIPS 2023 2023
Self-Refine: Iterative Refinement with Self-Feedback
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, Peter Clark
Submitted 30 Mar 2023
Open paper
arXiv:2303.11366NeurIPS 2023 2023
Reflexion: Language Agents with Verbal Reinforcement Learning
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao
Submitted 20 Mar 2023
Open paper
arXiv:2201.11903NeurIPS 2022 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou
Submitted 28 Jan 2022
Open paper
arXiv:2110.141682021
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, John Schulman
Submitted 27 Oct 2021
Open paper
arXiv:1911.03705Findings of EMNLP 2020 2020
CommonGen: A Constrained Text Generation Challenge for Generative Commonsense Reasoning
Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, Pei Zhou, Chandra Bhagavatula, Yejin Choi, Xiang Ren
Submitted 9 Nov 2019
Open paper
arXiv:2302.078672023
Learning Performance-Improving Code Edits
Aman Madaan, Alexander Shypula, Uri Alon, Milad Hashemi, Parthasarathy Ranganathan, Yiming Yang, Graham Neubig, Amir Yazdanbakhsh
Submitted 15 Feb 2023
Open paper
arXiv:2211.000532022
Generating Sequences by Learning to Self-Correct
Sean Welleck, Ximing Lu, Peter West, Faeze Brahman, Tianxiao Shen, Daniel Khashabi, Yejin Choi
Submitted 31 Oct 2022
Open paper
arXiv:2406.01297TACL 2024 2024
When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs
Ryo Kamoi, Yusen Zhang, Nan Zhang, Jiawei Han, Rui Zhang
Submitted 3 Jun 2024
Open paper
arXiv:2212.080732022
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al.
Submitted 15 Dec 2022
Open paper