Is That Improvement Real? A visual guide to eval statistics
You’re building an agent — tweaking the prompt, swapping tools, adjusting the harness. You run your eval and the score goes from 71% to 74%. That looks like progress — but without the right analysis, you can’t tell whether that 3-point jump is a real improvement or just a lucky draw.
If you’re making product decisions based on eval numbers, you need to know how much of that score is signal and how much is noise. And when you’re comparing two setups, the way you handle that uncertainty can flip the conclusion entirely.
This interactive guide is a companion to Anthropic’s eval guide. We walk through the statistics behind comparing two agents on a concrete example — with less assumed stats background than the original paper.
A future post will cover eval design itself — what to measure, how to build question sets, and how to tie evals to product decisions. This one focuses on the statistics.
The problem
Let's say you're comparing two travel planning agents — Atlas and Breeze. You run your eval suite and see the results on the right: a table of average scores across three benchmarks. This is how performance is typically reported — whether in a research paper, a blog post, or on your internal dashboard.
Breeze wins two out of three. Should you switch? The problem: every time you run these evals, you get slightly different numbers. Individual scores and averages shift between runs. A table like this doesn't tell you whether the difference is real — or just noise from this particular run.
We'll tackle this in two parts. First, we'll learn how to quantify the uncertainty in a single agent's score — turning a bare number into a range. Then we'll use that to compare two agents and figure out if the difference between them is real.
Part 1 — Measuring one agent
One task, many outcomes
When we talk about an eval score in this guide, we mean a number between 0 and 1 — where 0 is a total fail and 1 is a perfect pass. We write it as , where the subscript refers to the task number — so is the score on task 1, on task 2, and so on.
This number is not absolute. LLMs always have some randomness to them — sampling temperature, non-deterministic decoding — and on agentic tasks with long multi-step trajectories, that randomness compounds. A slightly different first step can lead to a completely different path. So the same agent on the same task can produce a different score each time.
Watch for the task “Find the cheapest round-trip flight to NYC under $500.” We run it 12 times. Each dot is one run. The agent has some true ability on this task — call it . That's the score it would average over thousands of runs. But you never get to see that number — you can only estimate it. Each actual run lands somewhere nearby:
is what you observe, and is the random noise. Crucially, this noise can go either way— positive or negative. Sometimes the agent gets lucky, sometimes it doesn't. On average the noise cancels out, but on any single run, it shifts your score.
A sample, not the whole picture
A real eval doesn't run one task — it runs many. But those tasks aren't all possible tasks you could test. They're a sample. A different set of tasks would give a different score — so your final number carries uncertainty from both which tasks you picked and which random outcomes the agent produced on them.
One agent's score — Agent Atlas
We ran Agent Atlas on travel-related tasks and obtained these scores:
| Q | Task | |
|---|---|---|
| Q1 | Book a round-trip flight to NYC under $500 | 0.83 |
| Q2 | Find a pet-friendly hotel in downtown Tokyo | 0.45 |
| Q3 | Build a 3-day itinerary for Rome with kids | 0.72 |
| Q4 | Compare train vs. flight for London → Paris | 0.69 |
| Q5 | Cancel and rebook a delayed connection | 0.91 |
| Q6 | Find restaurants near the Colosseum open Sunday | 0.30 |
| Q7 | Convert a multi-city trip into loyalty points | 0.67 |
| Q8 | Suggest packing list for a week in Iceland | 0.51 |
| Q9 | Find the cheapest rental car in Lisbon | 0.80 |
| Q10 | Book an airport transfer for a 6 AM arrival | 0.48 |
Remember: each time we run this eval, we'd get slightly different scores. Each score appears as a dot on . Try it — hit the button below and watch the numbers and dots update:
Now let's describe this agent's performance statistically — and build toward a confidence interval.
The average
The most common way to summarize an eval: add up all the scores and divide by the number of tasks. That's the average — written as .
When someone says “Atlas scored 63.6% on our eval,” this is the number they mean — the line marked on . It's the single most reported statistic in any benchmark result. Useful — but by itself, it hides something important.
Why the average isn't enough
The average looks like a solid number. But watch what happens in when we rerun the exact same eval — same tasks, fresh randomness. The dots shift, the average moves. Every run gives Atlas a different score, even though the agent hasn't changed at all.
| Run | |
|---|---|
| Run 1 | 0.636 |
This wobble is the uncertainty we need to measure. In theory you could run the eval hundreds of times and observe it directly. In practice, you might run it a handful of times — but never enough to pin it down precisely. Every run costs time, compute, and API calls.
What we really want is not just a single number, but a range— something that says “Atlas's true performance probably falls between X% and Y%.” In statistics, that's called a confidence interval. To calculate one, we need three building blocks that each build on the last:
- Variance — how spread out the individual scores are
- Standard deviation — the same thing, in readable units
- Standard error — how uncertain the average is
These are standard statistical formulas. Let's walk through them one at a time.
Variance
To measure how much the scores spread out, we need a single number that captures “how far are the scores from the average, on the whole?” That's variance: take each score's gap from the average, square it (so negatives don't cancel and big deviations count more), and average those squared gaps. shows each squared deviation as a box — bigger boxes mean scores that are further from the mean.
| Q | |||
|---|---|---|---|
| Q1 | 0.83 | +0.194 | 0.0376 |
| Q2 | 0.45 | -0.186 | 0.0346 |
| Q3 | 0.72 | +0.084 | 0.0071 |
| Q4 | 0.69 | +0.054 | 0.0029 |
| Q5 | 0.91 | +0.274 | 0.0751 |
| Q6 | 0.30 | -0.336 | 0.1129 |
| Q7 | 0.67 | +0.034 | 0.0012 |
| Q8 | 0.51 | -0.126 | 0.0159 |
| Q9 | 0.80 | +0.164 | 0.0269 |
| Q10 | 0.48 | -0.156 | 0.0243 |
| = | 0.0376 | ||
We divide by instead of — this is called Bessel's correction. It compensates for the fact that we're estimating from a sample, not the full population, and has been shown to produce a slightly more accurate estimate. A mathematical formality — don't get stuck on it.
Standard deviation
Variance tells us the spread, but it's in squared units — hard to interpret alongside your actual scores. The standard deviation (SD) is simply the square root of the variance, which puts it back on the same scale:
Think of SD as “on average, how far is a typical score from the mean?” For Atlas, that's about 19 percentage points. The shaded band in shows SD around the mean — roughly two thirds of individual scores will fall inside that band. Scores outside it are unusually high or low for this agent.
Standard error
showed how spread out individual scores are. But what we actually care about is: how much does the average wobble between runs? That's the standard error (SE):
Notice we divide by . The more tasks in your eval, the smaller the SE becomes — more data means less uncertainty in the average. Compare the band in to the wider SD band from the previous step — SE is noticeably tighter. With our 10 tasks, it's still fairly large. In a real eval you'd typically have more, but we're keeping the numbers manageable here.
SE describes your estimate — how uncertain the average is. It shrinks as you add more tasks, because each new data point helps cancel out noise.
Confidence interval
Now we can build the range we set out to find. By convention, most evaluations use a 95% confidence interval (you may also see this written as ). This means: if we repeated this eval 20 times, roughly 19 of those 20 confidence intervals would contain the agent's true performance.
To calculate it, we take the average and add/subtract 1.96 times the standard error. The interval goes in both directions because the true score could be higher or lower than what we observed:
So Atlas scored 63.6%, but its true performance likely falls somewhere between 51.6% and 75.6%. That's a 24percentage-point window — much less certain than a bare number suggests. This is what should appear alongside every eval score (see ).
Part 2 — Which agent should you ship?
Comparing two agents
We can now quantify the uncertainty in a single agent's score (as we saw in ). But the real question is: is one agent better than the other? Let's bring in Agent Breeze (B) and compare it against Atlas (A) on the same 10 tasks. Here are both agents' scores side by side:
| Q | Atlas | Breeze |
|---|---|---|
| Q1 | 0.83 | 0.89 |
| Q2 | 0.45 | 0.58 |
| Q3 | 0.72 | 0.74 |
| Q4 | 0.69 | 0.56 |
| Q5 | 0.91 | 0.94 |
| Q6 | 0.30 | 0.35 |
| Q7 | 0.67 | 0.80 |
| Q8 | 0.51 | 0.59 |
| Q9 | 0.80 | 0.88 |
| Q10 | 0.48 | 0.53 |
| Mean | 0.636 | 0.686 |
plots both agents on the same axis — you can already see the scores overlap heavily.
When is one agent “significantly” better?
Breeze scored 68.6% vs Atlas's 63.6%. But both numbers have uncertainty. To decide if the gap is real, we first compute the difference in means:
If this is negative, B scored higher. If positive, A scored higher. But because both averages have noise, this difference has noise too — so we need a confidence interval around it. The question is whether that CI falls entirely on one side of zero:
CI entirely below 0→ B is genuinely better (B always outscores A)
CI entirely above 0→ A is genuinely better (A always outscores B)
CI crosses 0→ could go either way — the difference might just be noise
This is the real value of a confidence interval: it's a guarantee that if you reran the eval tomorrow with fresh data, your conclusion would almost certainly hold. When the CI doesn't cross zero, you can be confident that resampling won't flip the result.
The question is how we compute that CI. Let's start with the most straightforward way.
The naive approach
shows each agent's mean with its 95% confidence interval. Notice how the bars overlap — either agent's true score could be anywhere in its range.
When you subtract two uncertain numbers, the uncertainties don't cancel — they add up. Each agent's score has noise, and the gap inherits noise from both. If we treat the two uncertainties as independent (a key assumption we'll revisit), the combined SE is:
That's 8.6% — larger than either agent's individual SE (6.1% and 6.1%). The observed difference is:
Building the 95% CI the same way as before:
The range goes from -21.9% to +11.9%. It crosses zero — so by the rule we just established, this difference is not significant.
But look at the questions
Before giving up, look at the raw scores task by task in . Q6 was hard for both agents. Q5 was easy for both. They tend to agree on which tasks are difficult and which are easy.
This isn't a coincidence. As Miller (2024) points out, this is a common pattern with LLMs on benchmarks: agents tend to struggle on the same tasks and excel on the same tasks. The shared difficulty of the questions creates a correlation between the two agents' scores.
The naive test threw this correlation away by treating each agent's uncertainty independently. But this shared structure is exactly the information we can exploit to get a more precise estimate of the difference.
The difference trick
Instead of comparing the averages, compare per task. For each task, compute the difference:
Each tells you: on this specific task, how much better did Atlas do than Breeze? Negative means Breeze won that task.
| Q | Atlas | Breeze | Diff |
|---|---|---|---|
| Q1 | 0.83 | 0.89 | -0.06 |
| Q2 | 0.45 | 0.58 | -0.13 |
| Q3 | 0.72 | 0.74 | -0.02 |
| Q4 | 0.69 | 0.56 | +0.13 |
| Q5 | 0.91 | 0.94 | -0.03 |
| Q6 | 0.30 | 0.35 | -0.05 |
| Q7 | 0.67 | 0.80 | -0.13 |
| Q8 | 0.51 | 0.59 | -0.08 |
| Q9 | 0.80 | 0.88 | -0.08 |
| Q10 | 0.48 | 0.53 | -0.05 |
| = | -0.050 | ||
Notice in how tightly clustered the values are compared to the original scores. The shared difficulty cancelled out during subtraction. What's left is the difference in agent ability, plus a small amount of noise.
Applying what we know
We now have 10 differences — one per task. This is just a dataset of 10 numbers, exactly like Atlas's 10 scores in Part 1. So we can walk through the same steps as –.
Step 1: The average difference. We already computed this: . On average, Atlas scored 5.0 percentage points lower than Breeze.
Step 2: How spread out are the differences? The SD of the differences is only — much smaller than the SD of either agent's raw scores. The shared difficulty cancelled out when we subtracted.
Step 3: How uncertain is the average difference?
Compare that to the naive SE of 0.0864 — the paired SE is 73% smaller. Less noise means a tighter CI.
Step 4: Build the confidence interval.
Does it cross zero? Look at — the entire interval is below zero.
Why did that work?
The paired test gave a tighter CI because subtracting per task removed the shared noise. But how much tighter, and why? To answer that, we need to understand how the two agents' scores relate — which brings us to covariance and correlation.
The scatter plot
Plot each task as a point: Atlas's score on the x-axis, Breeze's on the y-axis. Points above the diagonal mean B did better; below means A did better.
Notice how the points roughly follow the diagonal — both agents find the same tasks easy or hard. This correlation is the key to everything that follows.
From variance to covariance
Remember how we computed variance in ? We measured each score's deviation from the mean and squared it. That squaring is actually multiplying the deviation by itself. On the plot, both axes show the same variable — so every product is a perfect square.
Variance is just covariance of a variable with itself. Now watch what happens when we swap the y-axis to a different variable…
Covariance
The y-axis becomes Breeze. The same operation — multiply the deviations — but now using a different variable for each axis. The dots slide off the diagonal. The squares become rectangles. And crucially: some products are now negative (when the agents disagree on a task).
| Q | product | ||
|---|---|---|---|
| Q1 | +0.194 | +0.204 | +0.0396 |
| Q2 | -0.186 | -0.106 | +0.0197 |
| Q3 | +0.084 | +0.054 | +0.0045 |
| Q4 | +0.054 | -0.126 | -0.0068 |
| Q5 | +0.274 | +0.254 | +0.0696 |
| Q6 | -0.336 | -0.336 | +0.1129 |
| Q7 | +0.034 | +0.114 | +0.0039 |
| Q8 | -0.126 | -0.096 | +0.0121 |
| Q9 | +0.164 | +0.194 | +0.0318 |
| Q10 | -0.156 | -0.156 | +0.0243 |
| = | 0.0346 | ||
Correlation
Covariance tells us the agents' scores move together, but its value depends on the scale of the scores. To get a number we can interpret universally, we normalize by the standard deviations. The result is the correlation coefficient , which always falls between −1 and +1:
What does the scale mean? means the agents move in perfect lockstep — when one scores high, so does the other. means no relationship at all. means they're perfectly opposite (rare in practice).
Our agents have — a strong positive correlation, visible in the tight clustering along the diagonal in . They agree heavily on which tasks are easy and which are hard.
Why pairing works
The naive approach () treats each agent's error independently. The total uncertainty is the sum of the two variances:
But we found that — the agents strongly agree on which tasks are hard. That shared difficulty is double-counted in the sum above. The paired formula subtracts it out:
Plug in the numbers:
That's 93% of the total variance — gone, because correlation already accounted for it. What remains:
That's the same 0.0232 we got earlier by computing directly from the differences. Two completely different paths — one through the raw values, one through the correlation — same answer. The correlation formula just explains why the differences had such a small SE: the shared difficulty cancelled out.
Down from 0.0864 to 0.0232 — a 73% reduction in standard error. puts both CIs side by side — the paired CI shrinks from spanning both sides of zero to landing entirely on one side:
| Method | SE | 95% CI | Sig? |
|---|---|---|---|
| Unpaired | 0.0864 | [-21.9%, 11.9%] | No |
| Paired | 0.0232 | [-9.5%, -0.5%] | Yes |
The punchline
The same data, analyzed two different ways, gave two different answers. The naive approach said we couldn't tell Atlas and Breeze apart. The paired approach — which exploits the fact that both agents face the same tasks — revealed a clear winner.
This isn't a contrived example. Any time you compare two setups on the same eval set, a paired comparison will give you a more precise answer than treating them independently.
What to do in practice
- Always report error bars. A bare score like “74%” is meaningless without knowing how much it would wobble on a different run.
- Use paired comparisons when your setups are tested on the same tasks. The shared difficulty is free precision — don't throw it away.
- Check if the CI crosses zero. That's the only question that matters for deciding whether a difference is real.
You started this guide staring at a score that went from 71% to 74% with no way to know if it meant anything. Now you have the framework: measure the uncertainty, build a confidence interval, and — when comparing two setups — use a paired test to get the most precise answer the data can give you.
For implementation details, see Miller (2024) and Anthropic's eval guide. A future post will cover the other half of the problem: designing the eval itself — what to measure, how to build task sets, and how to connect eval results to product decisions.
If you found this useful, have questions, or spotted something that could be clearer — send me a DM on X. And if you think someone on your team would benefit from this, share it: