Is That Improvement Real? A visual guide to eval statistics

@kraxkrokat · April 2026

You’re building an agent. Tweaking the prompt, swapping tools, adjusting the harness. You run your eval and the score goes from 71% to 74%. That looks like progress, but without the right analysis, you can’t tell whether that 3-point jump is a real improvement or just a lucky draw.

If you’re making product decisions based on eval numbers, you need to know how much of that score is signal and how much is noise. And when you’re comparing two setups, the way you handle that uncertainty can flip the conclusion entirely.

This interactive guide is a companion to Anthropic’s eval guide. We walk through the statistics behind comparing two agents on a concrete example, with less assumed stats background than the original paper.

A future post will cover eval design itself: what to measure, how to build question sets, and how to tie evals to product decisions. This one focuses on the statistics.

The problem

Let's say you're comparing two travel planning agents, Atlas and Breeze. You run your eval suite and see the results on the right: a table of average scores across three benchmarks. This is how performance is typically reported, whether in a research paper, a blog post, or on your internal dashboard.

Breeze wins two out of three. Should you switch? The problem: every time you run these evals, you get slightly different numbers. Individual scores and averages shift between runs. A table like this doesn't tell you whether the difference is real or just noise from this particular run.

We'll tackle this in two parts. First, we'll learn how to quantify the uncertainty in a single agent's score, turning a bare number into a range. Then we'll use that to compare two agents and figure out if the difference between them is real.

Part 1: Measuring one agent

One task, many outcomes

When we talk about an eval score in this guide, we mean a number between 0 and 1, where 0 is a total fail and 1 is a perfect pass. We write it as , where the subscript refers to the task number. So is the score on task 1, on task 2, and so on.

This number is not absolute. LLMs always have some randomness to them (sampling temperature, non-deterministic decoding), and on agentic tasks with long multi-step trajectories, that randomness compounds. A slightly different first step can lead to a completely different path. So the same agent on the same task can produce a different score each time.

Watch for the task “Find the cheapest round-trip flight to NYC under $500.” We run it 12 times. Each dot is one run. The agent has some true ability on this task. Call it . That's the score it would average over thousands of runs. But you never get to see that number. You can only estimate it. Each actual run lands somewhere nearby:

is what you observe, and is the random noise. Crucially, this noise can go either way, positive or negative. Sometimes the agent gets lucky, sometimes it doesn't. On average the noise cancels out, but on any single run, it shifts your score.

A sample, not the whole picture

A real eval doesn't run one task. It runs many. But those tasks aren't all possible tasks you could test. They're a sample. A different set of tasks would give a different score, so your final number carries uncertainty from both which tasks you picked and which random outcomes the agent produced on them.

One agent's score: Agent Atlas

We ran Agent Atlas on travel-related tasks and obtained these scores:

Q	Task
Q1	Book a round-trip flight to NYC under $500	0.83
Q2	Find a pet-friendly hotel in downtown Tokyo	0.45
Q3	Build a 3-day itinerary for Rome with kids	0.72
Q4	Compare train vs. flight for London → Paris	0.69
Q5	Cancel and rebook a delayed connection	0.91
Q6	Find restaurants near the Colosseum open Sunday	0.30
Q7	Convert a multi-city trip into loyalty points	0.67
Q8	Suggest packing list for a week in Iceland	0.51
Q9	Find the cheapest rental car in Lisbon	0.80
Q10	Book an airport transfer for a 6 AM arrival	0.48

Remember: each time we run this eval, we'd get slightly different scores. Each score appears as a dot on . Try it: hit the button below and watch the numbers and dots update:

This guide is fully interactive. Every time you hit Rerun eval, the scores regenerate with fresh randomness, and every table, formula, and visualization on this page updates to match. Try it as you read.

Now let's describe this agent's performance statistically, and build toward a confidence interval.

The average

The most common way to summarize an eval: add up all the scores and divide by the number of tasks. That's the average, written as .

When someone says “Atlas scored 63.6% on our eval,” this is the number they mean, the line marked on . It's the single most reported statistic in any benchmark result. Useful, but by itself, it hides something important.

Why the average isn't enough

The average looks like a solid number. But watch what happens in when we rerun the exact same eval. Same tasks, fresh randomness. The dots shift, the average moves. Every run gives Atlas a different score, even though the agent hasn't changed at all.

Run
Run 1	0.636

This wobble is the uncertainty we need to measure. In theory you could run the eval hundreds of times and observe it directly. In practice, you might run it a handful of times But never enough to pin it down precisely. Every run costs time, compute, and API calls.

What we really want is not just a single number, but a range, something that says “Atlas's true performance probably falls between X% and Y%.” In statistics, that's called a confidence interval. To calculate one, we need three building blocks that each build on the last:

Variance: how spread out the individual scores are
Standard deviation: the same thing, in readable units
Standard error: how uncertain the average is

These are standard statistical formulas. Let's walk through them one at a time.

Variance

To measure how much the scores spread out, we need a single number that captures “how far are the scores from the average, on the whole?” That's variance: take each score's gap from the average, square it (so negatives don't cancel and big deviations count more), and average those squared gaps. shows each squared deviation as a box. Bigger boxes mean scores that are further from the mean.

Q
Q1	0.83	+0.194	0.0376
Q2	0.45	-0.186	0.0346
Q3	0.72	+0.084	0.0071
Q4	0.69	+0.054	0.0029
Q5	0.91	+0.274	0.0751
Q6	0.30	-0.336	0.1129
Q7	0.67	+0.034	0.0012
Q8	0.51	-0.126	0.0159
Q9	0.80	+0.164	0.0269
Q10	0.48	-0.156	0.0243
=			0.0376

We divide by instead of . This is called Bessel's correction. It compensates for the fact that we're estimating from a sample, not the full population, and has been shown to produce a slightly more accurate estimate. A mathematical formality. Don't get stuck on it.

Standard deviation

Variance tells us the spread, but it's in squared units which is hard to interpret alongside your actual scores. The standard deviation (SD) is simply the square root of the variance, which puts it back on the same scale:

Think of SD as “on average, how far is a typical score from the mean?” For Atlas, that's about 19 percentage points. The shaded band in shows SD around the mean. Roughly two thirds of individual scores will fall inside that band. Scores outside it are unusually high or low for this agent.

Standard error

showed how spread out individual scores are. But what we actually care about is: how much does the average wobble between runs? That's the standard error (SE):

Notice we divide by . The more tasks in your eval, the smaller the SE becomes. More data means less uncertainty in the average. Compare the band in to the wider SD band from the previous step. SE is noticeably tighter. With our 10 tasks, it's still fairly large. In a real eval you'd typically have more, but we're keeping the numbers manageable here.

SD describes the tasks: how much individual scores vary. Adding more tasks to your eval doesn't change it.
SE describes your estimate: how uncertain the average is. It shrinks as you add more tasks, because each new data point helps cancel out noise.

Confidence interval

Now we can build the range we set out to find. By convention, most evaluations use a 95% confidence interval (you may also see this written as ). This means: if we repeated this eval 20 times, roughly 19 of those 20 confidence intervals would contain the agent's true performance.

To calculate it, we take the average and add/subtract 1.96 times the standard error. The interval goes in both directions because the true score could be higher or lower than what we observed:

So Atlas scored 63.6%, but its true performance likely falls somewhere between 51.6% and 75.6%. That's a 24 percentage-point window, much less certain than a bare number suggests. This is what should appear alongside every eval score (see ).

That window feels wide. The natural question: how many tasks would you need to narrow it? More tasks means more signal, so your CI shrinks. But the returns diminish: going from 10 to 40 tasks cuts the CI in half, but halving it again takes 160. Each doubling of precision costs four times the compute. Try dragging the slider below to see this for yourself.

Explore

How sample size affects uncertainty

95% CI51.6% to 75.6%±12.0pp

dashed = your eval (n = 10)

To halve this CI, you need 4× the tasks: n = 40

Number of eval tasks (n)10

n = 10

Part 2: Which agent should you ship?

Comparing two agents

We can now quantify the uncertainty in a single agent's score (as we saw in ). But the real question is: is one agent better than the other? Let's bring in Agent Breeze (B) and compare it against Atlas (A) on the same 10 tasks. Here are both agents' scores side by side:

Q	Atlas	Breeze
Q1	0.83	0.89
Q2	0.45	0.58
Q3	0.72	0.74
Q4	0.69	0.56
Q5	0.91	0.94
Q6	0.30	0.35
Q7	0.67	0.80
Q8	0.51	0.59
Q9	0.80	0.88
Q10	0.48	0.53
Mean	0.636	0.686

plots both agents on the same axis. You can already see the scores overlap heavily.

When is one agent “significantly” better?

Breeze scored 68.6% vs Atlas's 63.6%. But both numbers have uncertainty. To decide if the gap is real, we first compute the difference in means:

If this is negative, B scored higher. If positive, A scored higher. But because both averages have noise, this difference has noise too, so we need a confidence interval around it. The question is whether that CI falls entirely on one side of zero:

Since we compute : a negative result means B (Breeze) outscored A (Atlas). So:

CI entirely below 0→ B is genuinely better (B always outscores A)
CI entirely above 0→ A is genuinely better (A always outscores B)
CI crosses 0→ could go either way . The difference might just be noise

This is the real value of a confidence interval: it's a guarantee that if you reran the eval tomorrow with fresh data, your conclusion would almost certainly hold. When the CI doesn't cross zero, you can be confident that resampling won't flip the result.

The question is how we compute that CI. Let's start with the most straightforward way.

The naive approach

shows each agent's mean with its 95% confidence interval. Notice how the bars overlap. Either agent's true score could be anywhere in its range.

When you subtract two uncertain numbers, the uncertainties don't cancel. They add up. Each agent's score has noise, and the gap inherits noise from both. If we treat the two uncertainties as independent (a key assumption we'll revisit), the combined SE is:

That's 8.6%, larger than either agent's individual SE (6.1% and 6.1%). The observed difference is:

Building the 95% CI the same way as before:

The range goes from -21.9% to +11.9%. It crosses zero, so by the rule we just established, this difference is not significant.

Using this naive approach, we can't tell whether Breeze is genuinely better or just got lucky. But this conclusion might be wrong, because we're throwing away useful information.

But look at the questions

Before giving up, look at the raw scores task by task in . Q6 was hard for both agents. Q5 was easy for both. They tend to agree on which tasks are difficult and which are easy.

This isn't a coincidence. As Miller (2024) points out, this is a common pattern with LLMs on benchmarks: agents tend to struggle on the same tasks and excel on the same tasks. The shared difficulty of the questions creates a correlation between the two agents' scores.

The naive test threw this correlation away by treating each agent's uncertainty independently. But this shared structure is exactly the information we can exploit to get a more precise estimate of the difference.

The difference trick

Instead of comparing the averages, compare per task. For each task, compute the difference:

Each tells you: on this specific task, how much better did Atlas do than Breeze? Negative means Breeze won that task.

Q	Atlas	Breeze	Diff
Q1	0.83	0.89	-0.06
Q2	0.45	0.58	-0.13
Q3	0.72	0.74	-0.02
Q4	0.69	0.56	+0.13
Q5	0.91	0.94	-0.03
Q6	0.30	0.35	-0.05
Q7	0.67	0.80	-0.13
Q8	0.51	0.59	-0.08
Q9	0.80	0.88	-0.08
Q10	0.48	0.53	-0.05
=			-0.050

Notice in how tightly clustered the values are compared to the original scores. The shared difficulty cancelled out during subtraction. What's left is the difference in agent ability, plus a small amount of noise.

Applying what we know

We now have 10 differences, one per task. This is just a dataset of 10 numbers, exactly like Atlas's 10 scores in Part 1. So we can walk through the same steps as –.

Step 1: The average difference. We already computed this: . On average, Atlas scored 5.0 percentage points lower than Breeze.

Step 2: How spread out are the differences? The SD of the differences is only . That's much smaller than the SD of either agent's raw scores. The shared difficulty cancelled out when we subtracted.

Step 3: How uncertain is the average difference?

Compare that to the naive SE of 0.0864. The paired SE is 73% smaller. Less noise means a tighter CI.

Step 4: Build the confidence interval.

Does it cross zero? Look at . The entire interval is below zero.

Breeze is genuinely better than Atlas. Same data, same 10 tasks, but a smarter analysis, one that doesn't throw away the shared difficulty, reveals a real difference that the naive approach missed.

Why did that work?

You already have the practical result: paired comparisons give tighter CIs. This section digs into why through covariance and correlation. It's useful for building deeper intuition, but feel free to if you just want the takeaways.

The paired test gave a tighter CI because subtracting per task removed the shared noise. But how much tighter, and why? To answer that, we need to understand how the two agents' scores relate, which brings us to covariance and correlation.

The scatter plot

Plot each task as a point: Atlas's score on the x-axis, Breeze's on the y-axis. Points above the diagonal mean B did better; below means A did better.

Notice how the points roughly follow the diagonal — both agents find the same tasks easy or hard. This correlation is the key to everything that follows.

From variance to covariance

Remember how we computed variance in ? We measured each score's deviation from the mean and squared it. That squaring is actually multiplying the deviation by itself. On the plot, both axes show the same variable — so every product is a perfect square.

Variance is just covariance of a variable with itself. Now watch what happens when we swap the y-axis to a different variable…

Covariance

The y-axis becomes Breeze. The same operation — multiply the deviations — but now using a different variable for each axis. The dots slide off the diagonal. The squares become rectangles. And crucially: some products are now negative (when the agents disagree on a task).

Q			product
Q1	+0.194	+0.204	+0.0396
Q2	-0.186	-0.106	+0.0197
Q3	+0.084	+0.054	+0.0045
Q4	+0.054	-0.126	-0.0068
Q5	+0.274	+0.254	+0.0696
Q6	-0.336	-0.336	+0.1129
Q7	+0.034	+0.114	+0.0039
Q8	-0.126	-0.096	+0.0121
Q9	+0.164	+0.194	+0.0318
Q10	-0.156	-0.156	+0.0243
=			0.0346

Correlation

Covariance tells us the agents' scores move together, but its value depends on the scale of the scores. To get a number we can interpret universally, we normalize by the standard deviations. The result is the correlation coefficient , which always falls between −1 and +1:

What does the scale mean? means the agents move in perfect lockstep: when one scores high, so does the other. means no relationship at all. means they're perfectly opposite (rare in practice).

Our agents have , a strong positive correlation, visible in the tight clustering along the diagonal in . They agree heavily on which tasks are easy and which are hard.

Remember this . It's about to show up in the paired SE formula, and it's the exact reason the paired test is more precise. The higher the correlation, the more shared noise gets cancelled out.

Why pairing works

The naive approach () treats each agent's error independently. The total uncertainty is the sum of the two variances:

But we found that — the agents strongly agree on which tasks are hard. That shared difficulty is double-counted in the sum above. The paired formula subtracts it out:

Plug in the numbers:

That's 93% of the total variance — gone, because correlation already accounted for it. What remains:

That's the same 0.0232 we got earlier by computing directly from the differences. Two completely different paths — one through the raw values, one through the correlation — same answer. The correlation formula just explains why the differences had such a small SE: the shared difficulty cancelled out.

Down from 0.0864 to 0.0232 — a 73% reduction in standard error. puts both CIs side by side. The paired CI shrinks from spanning both sides of zero to landing entirely on one side:

Method	SE	95% CI	Sig?
Unpaired	0.0864	[-21.9%, 11.9%]	No
Paired	0.0232	[-9.5%, -0.5%]	Yes

The reduction depends entirely on how correlated the two agents are. Try dragging the slider below to see how the paired SE changes as correlation increases. At , pairing gives you nothing. At the correlation in your data, it removes most of the noise.

Explore

How correlation affects precision

← B betterA better →

Unpaired[-21.9%, 11.9%]

Paired[-9.5%, -0.5%] ✓

Pairing removes 73% of the uncertainty. The difference is significant.

Correlation (r)0.93

your data

The punchline

The same data, analyzed two different ways, gave two different answers. The naive approach said we couldn't tell Atlas and Breeze apart. The paired approach, which exploits the fact that both agents face the same tasks, revealed a clear winner.

This isn't a contrived example. Any time you compare two setups on the same eval set, a paired comparison will give you a more precise answer than treating them independently.

What to do in practice

Always report error bars. A bare score like “74%” is meaningless without knowing how much it would wobble on a different run.
Use paired comparisons when your setups are tested on the same tasks. The shared difficulty is free precision. Don't throw it away.
Check if the CI crosses zero. That's the only question that matters for deciding whether a difference is real.

Next time you see a 3-point improvement on your eval dashboard, you'll know exactly what to ask: does the paired confidence interval cross zero? If not, ship it.

You started this guide staring at a score that went from 71% to 74% with no way to know if it meant anything. Now you have the framework: measure the uncertainty, build a confidence interval, and when comparing two setups, use a paired test to get the most precise answer the data can give you.

For implementation details, see Miller (2024) and Anthropic's eval guide. A future post will cover the other half of the problem: designing the eval itself: what to measure, how to build task sets, and how to connect eval results to product decisions.

If you found this useful, have questions, or spotted something that could be clearer: send me a DM on X.

If you found this helpful, share it:

Share on X