Interpretability is having a moment. Here's why it matters now.

Years of patient mech-interp work just turned a corner. The result is the most concrete reason for optimism in alignment research in years.

2 min
Interpretability is having a moment. Here's why it matters now.

For most of the last five years, "interpretability" has been the polite word AI labs used for "we are extremely smart and patient and we are trying."

Sparse autoencoders, dictionary learning, circuit discovery — these were respectable lines of research that produced fascinating papers that did not, in any operational sense, change how anyone deployed a model. You could not point at a frontier system and say "we know why it does what it does."

That sentence is starting to shift, and it shifted in a hurry over the past nine months.

What changed

Three things happened roughly at once:

Feature-level steering became reliable. The Anthropic and DeepMind papers on activation editing, building on the SAE work from 2024, crossed a threshold where you can identify a feature, intervene on it, and predict the behavior change with reasonable confidence. The famous Golden Gate Claude demo was a parlor trick; the new generation of work is closer to "edit this circuit and you predictably reduce sycophancy by 22% on this eval, with measurable cost on this other eval."

Causal scrubbing got cheap enough to be a tool. What used to be a multi-week research project is now a workflow. Causal scrubbing — the technique of replacing parts of a model's computation with shuffled versions to see what actually matters — has been turned into a library researchers can run on any released model.

Labs started publishing real interventions, not just observations. This is the biggest shift. We now have multiple production examples of "we found this circuit, we modified the training procedure, and here is the eval delta." That is interpretability that matters.

Why this is different from the earlier hype

Previous waves of interpretability optimism, going back to the original Anthropic transformer-circuits work, ran into the same wall: the techniques generalized poorly to large models, and the insights were hard to operationalize.

Both walls are getting climbed, slowly. Larger SAEs train better than the small ones did, and frontier-scale dictionaries are now public. The "operationalize" wall is being climbed by tooling — debugger-style interfaces over models that turn what used to be PhD-level investigation into something a senior engineer can do in an afternoon.

What it doesn't yet do

It is worth being calibrated. Interpretability is not yet:

  • A safety guarantee. No one is claiming "we have proven this model is aligned."
  • A complete picture. We can identify some features. We cannot enumerate all of them.
  • Independent of training. Most of what we know about a model's circuits comes from cooperative labs that publish weights and tooling. Closed model interpretability is still embryonic.

But the trajectory is right. For the first time, "we want to ship a model with provable properties" feels like an engineering problem with known intermediate steps, not a research direction with unknown unknowns.

What to watch over the next year

Three concrete signposts:

  1. Production-grade SAE-based dashboards. When labs ship internal tooling that lets non-researchers diagnose model failures via feature activations, the technique has crossed the chasm.
  2. Public interpretability commitments from regulators. Watch for "model providers shall publish interpretability artifacts for systems above N FLOPs" — this is not far off in the EU.
  3. First credible "we caught this misalignment via interpretability" announcement. Inevitable; the question is when, and whether it's a real finding or theatre.

Years of careful, unglamorous work just compounded. This is what the slow, steady progress in alignment research looks like when it actually pays off.

Written by

More to read

  • Coming soon

    This is llms.blog, a brand new site by Andrew that's just getting started. Things will be up and running here shortly, but you can subscribe in the meantime if you'd like to stay up to date and receive emails when new content is published!

    1 min