Frontier evals are converging. That tells us less than you think.

Every model on the leaderboard is within a few points of every other. Here's what that does — and doesn't — mean for which one you should ship.

2 min
Frontier evals are converging. That tells us less than you think.

If you stare at MMLU, GPQA, SWE-Bench, and any of the agent harnesses long enough, the same picture emerges: the top six labs are within a few percentage points of each other across almost every public benchmark.

The natural read is that we have hit a capabilities plateau. The actual read is more interesting.

The plateau is partly a measurement artifact

Benchmarks have headroom to be saturated. Once a frontier model hits 92% on MMLU, the difference between 92% and 93% is mostly variance in the remaining hard tasks — most of which are ambiguous, or have wrong gold answers, or measure something other than what they were named for.

The gap between models is real, but the visible gap is shrinking faster than the underlying gap because the rulers we use can't measure past a certain point. We are at the wall of the easy-to-measure.

The expensive evals are doing the talking

The evals where labs differentiate now are the ones that cost real money to run: long-horizon software engineering tasks (multi-hour traces), agentic browsing, multi-document research, and domain-specific deep work like proof verification or formal protocol design.

These are the evals where claims like "Claude is the best at long-context coding" or "GPT is the best at math research" get made. They are also the evals that almost no individual user looks at. Most people pick a model based on a vibes check on their personal task. Which is fine — that's actually a reasonable signal — but it should not be confused with a benchmark.

What this means for which model to use

A few practical rules of thumb:

For chat and short-form generation: any of the top six are interchangeable. Pick on price, latency, or which API you already trust.

For coding agents (multi-step, multi-file): the gap is real and matters. The leader changes every few months. Test on your repo, not on a leaderboard.

For long-context retrieval: model-by-model differences in needle-in-haystack are small. Differences in behavior (does it hallucinate quietly? does it cite?) are large. Test for behavior, not just accuracy.

For domain tasks (legal, medical, finance): most of the variance is in your prompt and grounding setup. Picking the "right" model is a 10% factor; everything else is 90%.

The real frontier is application-level

The implication of converging benchmarks is not that progress has stopped. It is that the differentiation has moved up the stack.

The interesting questions for the next year are at the application layer: how good is your retrieval? How fast is your tool-call fan-out? How sane is your eval set? Are your prompts cached? Have you actually pinned a model?

Pick a frontier model that fits your latency and price budget, treat that choice as a 6-month commitment, and put your engineering effort into the things where you can actually move the needle.

The leaderboard will keep moving. Yours might too — at the application layer, where it counts.

Written by

More to read

  • Coming soon

    This is llms.blog, a brand new site by Andrew that's just getting started. Things will be up and running here shortly, but you can subscribe in the meantime if you'd like to stay up to date and receive emails when new content is published!

    1 min