Benchmarks Are Becoming Storytelling Tools
Benchmarks Are Becoming Storytelling Tools
Having trained more neural nets than I can count, I treat benchmarks with the same reverence and suspicion as lab equipment. They’re indispensable, yet anyone who controls the data split knows exactly how to make a model look better than it is. Over the past three years the public leaderboard races stopped feeling like discovery and started feeling like predictably heroic marketing. Every new model appears a few points up on the same suite of benchmarks, give or take an outlier, and the coincidence is getting hard to swallow.
Visual ideas
- Minimal chart showing benchmark inflation with annotations like “training on the test.”
- Split illustration of a lab bench versus a PR podium sharing the same data line.
- Flow diagram of data pipelines (train/val/test) with contamination callouts.
We know how the sausage is made
Anyone who has trained production models carries muscle memory about clean splits: train to learn, validation to select, test to measure true performance. That discipline keeps you honest. It also teaches you exactly which knobs to turn if you ever wanted to cheat. Curate a validation set that mirrors a benchmark, overweight those samples in RLHF, or run extra finetuning passes on leaked solutions and you can claim progress that doesn’t exist in the wild.
Benchmarks like MMLU or GSM have been scraped, shared, and remixed for years. Pretending teams don’t have these questions in their pretraining mix is disingenuous. I don’t think everyone is acting maliciously—it’s just too easy, especially when investors want an “up and to the right” graphic every quarter. The excitement fades when every release feels pre-ordained.
RL makes benchmark worship even easier
Ilya Sutskever recently acknowledged the elephant in the room during his conversation with Dwarkesh Patel. Once reinforcement learning objectives are trained directly against known tasks, you can ratchet up scores without teaching the model anything general. The rules of the benchmark, the solution style, even the exact prompts are public. Reward models learn to mirror them, and suddenly the benchmark turns into reinforcement curriculum rather than a neutral yardstick.
This is where the “numbers lie” feeling comes from. On paper the frontier systems appear miles ahead, but when you actually work with them they still hallucinate citations, misread PDFs, or crumble on a new domain. I keep an eye on projects like the [[AI Omniscience index]] because fresh tasks temporarily reset the clock: nobody has overfit yet, so the spread between labs says something real—at least until the next round of fine-tuning closes the gap.
Chase fresh problems, not stale leaderboards
None of this means progress stalled. It just means the metrics we obsess over now describe an environment the models already memorized. The only antidote is relentless curiosity for new, genuinely unseen problems.
When a new benchmark drops, I treat the first two release cycles as the golden window: calibrate expectations, capture observations, and then move on before the scores converge. The rest of the time I focus on qualitative probes—can the model operate a shell, refactor a codebase, or plan an experiment without spoon-fed cues? Those capabilities still feel laggy compared to the shiny leaderboard deltas.