Analyzing JMeter Results: Why Percentiles Beat Averages

How to properly analyze JMeter result data using percentiles instead of averages, with a worked example showing how averages hide real problems.

· By perf-test.com Editorial · AI-assisted
jmeteranalysispercentiles

A JMeter Aggregate Report shows average, min, max, and percentile columns side by side — and the average is usually the least useful number on the page, despite being the one people quote first.

A worked example

Imagine 100 requests: 95 complete in 200ms, and 5 complete in 10 seconds (a downstream timeout-and-retry path). The average is (95×200 + 5×10000) / 100 = 690ms — a number that doesn’t describe either group of users. The 95th percentile is 200ms (correctly showing most users are fine); the 99th percentile would land in that slow group, correctly flagging that the worst 1% have a serious problem the average completely hides.

Why this matters more at scale

At low request volume, a handful of slow outliers might genuinely be noise. At production scale — millions of requests — a 1% tail-latency problem is tens of thousands of real, frustrated users, even while the average looks perfectly healthy. This is precisely why SRE practice defines SLOs in percentile terms (“99% of requests under 300ms”), not average terms — see this site’s article on SLOs and error budgets for how that connects to actual operational decisions.

Reading JMeter’s percentile columns correctly

The Aggregate Report’s 90%, 95%, and 99% Line columns are calculated per sampler/transaction across the full run — including any ramp-up period before the system reached steady state, unless you’ve excluded it. A percentile calculated across a run that includes ramp-up will be polluted by artificially slow early samples (connections still warming up, JIT not yet optimized) that don’t represent steady-state behavior. For analysis that matters, filter the raw .jtl data to exclude the ramp-up window before computing percentiles yourself, rather than trusting the dashboard’s full-run figure uncritically.

Percentile-of-percentiles is a trap in distributed testing

If you run distributed across multiple load generator engines and naively average each engine’s own 95th percentile together, you get a mathematically meaningless number — percentiles don’t combine that way. The correct approach is combining the raw per-sample data from all engines first, then computing the percentile once across the full combined dataset.

Visualizing the distribution, not just one number

A single percentile value, however correctly computed, still flattens a distribution into one point. Where possible, look at (or build) a histogram or response-time-over-time scatter plot of the raw data — it often reveals bimodal behavior (two distinct populations of fast and slow requests, perhaps a cache-hit/cache-miss split) that no single percentile number communicates on its own.

A practical takeaway for reporting

When sharing results with stakeholders, lead with p95/p99, not the average — and if you must include the average for context, pair it with the max or p99 so the spread is visible, not just a single deceptively reassuring central number.

Takeaway: averages and percentiles answer different questions — “what’s typical” versus “how bad does it get for the unlucky tail” — and performance decisions almost always depend on the second question, not the first.

Discussions coming soon.

Comments are powered by Giscus (GitHub Discussions). Enable them by configuring GISCUS in src/consts.ts — see giscus.app.