Frontier evals are converging. That tells us less than you think.
Every model on the leaderboard is within a few points of every other. Here's what that does — and doesn't — mean for which one you should ship.
1 min
Articles
Every model on the leaderboard is within a few points of every other. Here's what that does — and doesn't — mean for which one you should ship.
Llama and Mistral and DeepSeek give us weights. They do not give us reproducibility. The next decade of AI policy turns on the difference.