If you ask a developer "what does open-source AI mean?" you'll get an answer about weights — usually some variation of "you can download it and run it locally."
This is correct, and it is also a substantial flattening of the actual question.
We have unprecedented availability of frontier-class model weights. Llama, Mistral, DeepSeek, Qwen, the cycle of releases from Chinese labs — anyone with a GPU can run something within striking distance of GPT-4-class. That is genuinely good. It changes the geometry of who can build with these systems.
But "open weights" does not mean "open model." It means "the artifact is free." The thing the artifact was made from — the data — is, with one or two notable exceptions, locked.
Why this matters
A weight file is the result of a recipe. Without the recipe, you can use the result, but you cannot:
- Audit what the model has been trained on (and therefore cannot audit for bias, contamination, or copyright concerns)
- Reproduce the model from scratch (and therefore cannot verify any of the lab's claims)
- Train a comparable model with different choices (and therefore cannot compete with the lab on equal footing)
- Train a better model by improving the recipe (and therefore cannot move the field forward in the way open source has moved every other field)
These are not academic concerns. The largest open-source efforts of the past decade — Linux, Postgres, Kubernetes — all rely on the recipe being open. Imagine Linux where the source was secret but the binaries were free. We would not call that open source. We would barely call it freeware.
The current landscape
A small number of efforts are doing the right thing here:
- OLMo (AI2) publishes weights, training data, training code, and intermediate checkpoints. The closest thing to a fully open frontier model.
- Pythia (EleutherAI) is older and smaller but exemplary in the same way.
- DataComp-LM and The Pile publish curated training corpora.
- DCLM demonstrates competitive scores from a fully open recipe.
These efforts get a fraction of the press of the next Llama release, and they are doing more for the long-term health of the ecosystem than all of the weight-only releases combined.
What the policy fight actually looks like
The interesting policy question of the next two years is not "should frontier models be open?" It is "what does open mean?"
Watch for:
- Reporting requirements that include data provenance, not just model card data. The EU AI Act's general-purpose model regime nibbles at this. Expect the next iteration to bite harder.
- Commercial-use restrictions in "open" licenses that make the weights effectively closed for the entities most likely to build on them. Llama's license is an early indicator; Meta's posture has shifted twice in eighteen months.
- Data laundering through synthetic generation. If your model's training data is "outputs of a closed model," the openness story collapses. We will see lawsuits about this within the year.
- Training-data carbon disclosure. Coming, slowly, but coming.
What developers should do about it
Two practical things:
- Distinguish between "open weights" and "open recipe" in how you describe the systems you build on. Words shape policy.
- Cite OLMo, Pythia, and the open-recipe community when you have the choice. Their continued funding depends on being visibly used. Closed data is the default; openness is a small, contingent culture that has to be defended.
The thing about openness in software is that it has always been a fight, never an inevitability. Linux was a long-shot bet that lost for years before it won. The "open AI" question is being decided right now, on a faster clock, with more money at stake. The fight worth having is not over weights. It's over what they were made from.
