While I'm always happy to see more people using open models, I was hoping the "playing" would be a bit more about actually interacting with the models themselves, rather than just running them.
For anyone interested in playing around with the internals of LLMs without needing to worry about having the hardware to train locally, a couple of projects I've found really fun and educational:
- Implement speculative decoding for two different sized models that share a tokenizer [0]
- Enforce structured outputs through constrained decoding (a great way to dive deeper in to regex parsing as well).
- Create a novel sampler using entropy or other information about token probabilities
The real value of open LLMs, at least for me, has been that they aren't black boxes, you can open them up and take a look inside. For all the AI hype it's a bit of shame that so few people seem to really be messing around with the insides of LLMs.
> On paper, this looks like a success. In practice, the time spent crafting a prompt, waiting for the AI to run and fixing the small issue that came up immensely exceeds the 10 minutes it would have taken me to edit the file myself. I don’t think coding that way would lead me to a massive performance improvement for now.
The models used in this experiment - deepseek-r1:8b, mistral:7b, qwen3:8b - are tiny. It's honestly a miracle that they produce anything that looks like working code at all!
I'm not surprised that the conclusion was that writing without LLM assistance would be more productive in this case.
Yeah those small models can work for SillyTavern or some basic rubber ducking but they're not nearly large enough for coding. I have had no luck coding with anything smaller than 30b models. I've found 13b models to be not terrible for boilerplate and code completion. But 8b seems way too dumb for the task.
Yeah, the truth is avoiding the big players is silly right now. It's not the small models won't eventually work either, we have no idea how they can get compressed in future. Especially with people trying to get the mixture of experts approach working.
Right now, you need the bigger models for good responses, but in a year's time?
So the whole exercise was a bit of a waste of his time, the present target moves too quickly. This isn't a time to be clutching your pearls about running your own models unless you want to do something shady with AI.
And like video streaming was progressed by the porn industry, a lot of people are watching the, um, "thirsty" AI enthusiasts for the big advances in small models.
That's too simplified IMHO. Local models can do a lot. Like sorting texts, annotating images, text-speech, speech-text. It's much cheaper when it works. Software development is not in the list because the quality of output defines the time developers spend prompting and fixing. It's just faster and cheaper to use big model.
This is definitely a useful exercise worth going through for the educational value before eventually surrendering and just using the big models owned by "unprofitable companies."
I am plenty able to keep up with experiments others are doing and their results, but as for me my time is best spent building things that have never existed before, that models have no prior art to implement.
How can you tell what you are building have never existed before? Especially without interacting with an LLM about it? You should assume it stores your chat though, so be careful about revealing it to ChatGPT and friends.
I build confidence this is true by using search engines that link to potentially related projects when I query them. The same data LLM trained on in the first place, except more up to date.
Deepseek R1 8B isn't famous for anything (except maybe being confused for Deepseek R1) and isn't by Deepseek anymore than me finetuning Llama makes me the creator of Llama.
The Open Source Initiative themselves decided a last year to relax their standards for AI models: they don't require the training data to be released. https://opensource.org/ai
They do continue to require the core freedoms, most importantly "Use the system for any purpose and without having to ask for permission". That's why a lot of the custom licenses (Llama etc) don't fit the OSI definition.
> The Open Source Initiative themselves decided a last year to relax their standards for AI models: they don't require the training data to be released.
Poor move IMO. Training data should be required to be released to be considered an open source model. Without it all I can do is set weights, etc. Without training data I can't truly reproduce the model, inspect the data for biases/audit the model for fairness, make improvements & redistribute (a core open source ethos).
Keeping the training data closed means it's not truly open.
Their justification for this was that, for many consequential models, releasing the training data just isn't possible.
Obviously the biggest example here is all of that training data which was scraped from the public web (or worse) and cannot be relicensed because the model producers do not have permission to relicense it.
There are other factors too though. A big one is things like health data - if you train a model that can e.g. visually detect cancer cells you want to be able to release that model without having to release the private health scans that it was trained on.
Wouldn't it be great though if it was public knowledge exactly on what they were trained on and how, even though the data itself cannot be freely copied?
> Poor move IMO. Training data should be required to be released to be considered an open source model.
The actual poor move is trying to fit the term "open source" onto AI models at all, rather than new terms with names that actually match how models are developed.
This notably marks a schism with the FSF; it's the first time and context in which "open-source" and "free software" have not been synonymous, coextensive terms.
I think it greatly diminishes the value of the concept and label of open-source. And it's honestly a bit tragic.
I don't agree with that definition. For a given model I want to know what I can/cannot expect from it. To have a better understanding of that, I need to know what it was trained on.
For a (somewhat extreme) example, what if I use the model to write children's stories, and suddenly it regurgitates Mein Kampf? That would certainly ruin the day.
Are you going to examine a few petabytes of data for each model you want to run, to check if a random paragraph from Main Kampf is in there? How?
We need better tools to examine the weights (what gets activated to which extent for which topics, for example). Getting full training corpus, while nice, cannot be our only choice.
> Are you going to examine a few petabytes of data for each model (...) How?
I can think of a few ways. Perhaps I'd use an LLM to find objectionable content. But anyway, it is the same argument as you can have against e.g. the Linux kernel. Are you going to read every line of code to see if it is secure? Maybe, or maybe not, but that is not the point.
The point is now a model is a black box. It might as well be a Trojan horse.
I mean many people I know have 100tb+ in storage at home now. A large enough team of dedicated community members cooperating and sharing compute resources online should be able to reproduce any model.
Big tech has been abusing open source to cheaply capture most of the internet and e-commerce anyway, so perhaps it's time we walked away from the term altogether.
The OSI has abdicated the future of open machine learning. And that's fine. We don't need them.
"Free software" is still a thing and it means a very specific and narrow set of criteria. [1, 2]
There's also "Fair software" [3], which walks the line between CC BY-NC-SA and shareware, but also sticks it to big tech by preventing Redis/Elasticsearch capture by the hyperscalers. There's an open game engine [4] that has a pretty nice "Apache + NC" type license.
---
Back on the main topic of "open machine learning": since the OSI fucked up, I came up with a ten point scale here [5] defining open AI models. It's just a draft, but if other people agree with the idea, I'll publish a website about it (so I'd appreciate your feedback!)
There are ten measures by which a model can/should be open:
1. The model code (pytorch, whatever)
2. The pre-training code
3. The fine-tuning code (which might be very different from the pre-training code)
4. The inference code
5. The raw training data (pre-training + fine-tuning)
6. The processed training data (which might vary across various stages of pre-training and fine-tuning: different sizes, features, batches, etc.)
7. The resultant weights blob(s)
8. The inference inputs and outputs (which also need a license; see also usage limits like O-RAIL)
9. The research paper(s) (hopefully the model is also described and characterized in the literature!)
10. The patents (or lack thereof)
A good open model will have nearly all of these made available. A fake "open" model might only give you two of ten.
< The source code must be the preferred form in which a programmer would modify the program. >
So, what's the preferred way to modify a model? You get the weights and then run fine-tuning with a relatively small amount of data. Which is way cheaper than re-training the entire thing from scratch.
---
The issue is that normal software doesn't have a way to modify the binary artifacts without completely recreating them, and that AI models not only do have that but have a large cost difference. The development lifecycle has nodes that don't exist for normal software.
Which means that really, AI models need their own different terminology that matches that difference. Say, open-weights and open-data or something.
Kinda like how Creative Commons is a thing because software development lifecycle concepts don't map very well to literature or artwork either.
Either the team that built the landing page (Marketing dept?) is wrong, or the legal department is wrong. I'm pretty sure I know who I'd bet on to be more correct.
Meta is indeed leading the gaslighting efforts to convince the press and masses that binary blobs provably built from pirated sources are actually Open Source.
The sad part is it is working. It is almost like Meta is especially skilled at mass public manipulation.
To be honest, they're more or less the same. Zuckerberg frequently talks proudly of his "open source models" constantly, while OpenAI at least doesn't claim to release open source models when they're not. Still, agree that the name should change, but in my eyes they're equally shit at knowing what "Open" is supposed to mean.
Are there any instances of OSI licensed code that are not Free Software making my statement here invalid?
I was attempting to direct that when software is called Open Source and actually is based on OSI licensed sources, then they are likely talking about Free Software.
The last time I heard a comment along those lines I was attending a session by an Open Source person and up on screen they had a picture of RMS dressed as Che Guevara.
All those silly ethics, they get in the way of the real work!
I'm not sure I fully understand - whilst I agree there's been useful legal work, we now have such a plethora of licenses I ended up having to back what I'd call basic common sense when someone was suggesting using a highly restrictive "community" license that had ridiculous intents such as saying you can't use it in this particular industry because that industry is "bad".
The reason Free/Libre Open Source Software wins - and always will do in the long run - is because the four freedoms are super-simple and they reflect how the natural world works.
The older I get the more I fear I am turning into Richard Stallman, but it is absolutely offensive to continually see corporate proprietary freeware binary blobs built from unlicensed sources be confused with Free Open Source Software which is a dramatically higher bar of investment in public good.
> The older I get the more I fear I am turning into Richard Stallman
You and me both. I always preferred and promoted FLOSS where possible but still had a bit of a pragmatic approach, but now the older I get the more I just want to rip out everything not free (as in freedom) from my life, and/or just go become a goat farmer.
Stallman was right from the beginning, and big tech have proven over and over again that they are incapable of being good citizens.
I'm probably a few more years away from "I'd like to interject for a moment..."
While I'm always happy to see more people using open models, I was hoping the "playing" would be a bit more about actually interacting with the models themselves, rather than just running them.
For anyone interested in playing around with the internals of LLMs without needing to worry about having the hardware to train locally, a couple of projects I've found really fun and educational:
- Implement speculative decoding for two different sized models that share a tokenizer [0]
- Enforce structured outputs through constrained decoding (a great way to dive deeper in to regex parsing as well).
- Create a novel sampler using entropy or other information about token probabilities
The real value of open LLMs, at least for me, has been that they aren't black boxes, you can open them up and take a look inside. For all the AI hype it's a bit of shame that so few people seem to really be messing around with the insides of LLMs.
0. https://arxiv.org/pdf/2211.17192
> On paper, this looks like a success. In practice, the time spent crafting a prompt, waiting for the AI to run and fixing the small issue that came up immensely exceeds the 10 minutes it would have taken me to edit the file myself. I don’t think coding that way would lead me to a massive performance improvement for now.
The models used in this experiment - deepseek-r1:8b, mistral:7b, qwen3:8b - are tiny. It's honestly a miracle that they produce anything that looks like working code at all!
I'm not surprised that the conclusion was that writing without LLM assistance would be more productive in this case.
Weird how this story came out a ~few hours later~ at about the same time: https://news.ycombinator.com/item?id=44723316
That isn't an open source model, but a quantized version of GLM-4.5, an open-weight model. I'd say there's hope yet for small, powerful open models.
Yeah those small models can work for SillyTavern or some basic rubber ducking but they're not nearly large enough for coding. I have had no luck coding with anything smaller than 30b models. I've found 13b models to be not terrible for boilerplate and code completion. But 8b seems way too dumb for the task.
Yeah, the truth is avoiding the big players is silly right now. It's not the small models won't eventually work either, we have no idea how they can get compressed in future. Especially with people trying to get the mixture of experts approach working.
Right now, you need the bigger models for good responses, but in a year's time?
So the whole exercise was a bit of a waste of his time, the present target moves too quickly. This isn't a time to be clutching your pearls about running your own models unless you want to do something shady with AI.
And like video streaming was progressed by the porn industry, a lot of people are watching the, um, "thirsty" AI enthusiasts for the big advances in small models.
That's too simplified IMHO. Local models can do a lot. Like sorting texts, annotating images, text-speech, speech-text. It's much cheaper when it works. Software development is not in the list because the quality of output defines the time developers spend prompting and fixing. It's just faster and cheaper to use big model.
Those LLM influencers don't know what is a distill. Deepseek R1 8B IS A DISTILLED Qwen2 .you should be using qwen3 8b-14b instead a lot better
That's literally what I ended up doing in the article tho?
"Deepseek R1" sounds cooler, everybody heard about it.
This is definitely a useful exercise worth going through for the educational value before eventually surrendering and just using the big models owned by "unprofitable companies."
Never accepted the terms of service of any proprietary models-as-a-service providers and never will.
Be one of the few humans still pretty good at using their own brains for those problems LLMs can't solve, and you will be very employable.
If you don't find out what the models can do, how can you know what problems they can't solve?
I am plenty able to keep up with experiments others are doing and their results, but as for me my time is best spent building things that have never existed before, that models have no prior art to implement.
How can you tell what you are building have never existed before? Especially without interacting with an LLM about it? You should assume it stores your chat though, so be careful about revealing it to ChatGPT and friends.
I build confidence this is true by using search engines that link to potentially related projects when I query them. The same data LLM trained on in the first place, except more up to date.
Nice writeup and status update on use of Foss ml things. Saves me a lot of time!
What kind of hardware do we need to run those models?
“I used all the top tier ‘Open Source LLMs’ and they suck I was right, like always, LLMs suck so hard, my job is safe!”
Deepseek R1 8B isn't famous for anything (except maybe being confused for Deepseek R1) and isn't by Deepseek anymore than me finetuning Llama makes me the creator of Llama.
Can we all please stop confusing freeware with Open Source?
If something can not be reproduced from sources which are all distributed under an OSI license it is not Open Source.
Non public sources of unknown license -> Closed source / Proprietary
No training code, no training sources -> Closed source / Proprietary
OSI public source code -> Open Source / Free Software
These terms are very well defined. https://opensource.org/osd
The Open Source Initiative themselves decided a last year to relax their standards for AI models: they don't require the training data to be released. https://opensource.org/ai
They do continue to require the core freedoms, most importantly "Use the system for any purpose and without having to ask for permission". That's why a lot of the custom licenses (Llama etc) don't fit the OSI definition.
> The Open Source Initiative themselves decided a last year to relax their standards for AI models: they don't require the training data to be released.
Poor move IMO. Training data should be required to be released to be considered an open source model. Without it all I can do is set weights, etc. Without training data I can't truly reproduce the model, inspect the data for biases/audit the model for fairness, make improvements & redistribute (a core open source ethos).
Keeping the training data closed means it's not truly open.
Their justification for this was that, for many consequential models, releasing the training data just isn't possible.
Obviously the biggest example here is all of that training data which was scraped from the public web (or worse) and cannot be relicensed because the model producers do not have permission to relicense it.
There are other factors too though. A big one is things like health data - if you train a model that can e.g. visually detect cancer cells you want to be able to release that model without having to release the private health scans that it was trained on.
See their FAQ item: Why do you allow the exclusion of some training data? https://opensource.org/ai/faq#why-do-you-allow-the-exclusion...
Wouldn't it be great though if it was public knowledge exactly on what they were trained on and how, even though the data itself cannot be freely copied?
> Poor move IMO. Training data should be required to be released to be considered an open source model.
The actual poor move is trying to fit the term "open source" onto AI models at all, rather than new terms with names that actually match how models are developed.
This notably marks a schism with the FSF; it's the first time and context in which "open-source" and "free software" have not been synonymous, coextensive terms.
I think it greatly diminishes the value of the concept and label of open-source. And it's honestly a bit tragic.
I don't agree with that definition. For a given model I want to know what I can/cannot expect from it. To have a better understanding of that, I need to know what it was trained on.
For a (somewhat extreme) example, what if I use the model to write children's stories, and suddenly it regurgitates Mein Kampf? That would certainly ruin the day.
Are you going to examine a few petabytes of data for each model you want to run, to check if a random paragraph from Main Kampf is in there? How?
We need better tools to examine the weights (what gets activated to which extent for which topics, for example). Getting full training corpus, while nice, cannot be our only choice.
> Are you going to examine a few petabytes of data for each model (...) How?
I can think of a few ways. Perhaps I'd use an LLM to find objectionable content. But anyway, it is the same argument as you can have against e.g. the Linux kernel. Are you going to read every line of code to see if it is secure? Maybe, or maybe not, but that is not the point.
The point is now a model is a black box. It might as well be a Trojan horse.
You would use an LLM to process a few petabytes of data to find a needle in the haystack?
Cheaper to train your own.
Let's pretend for a moment that the entire training corpus for Deepseek-R1 were released.
How would you download it?
Where would you store it?
I mean many people I know have 100tb+ in storage at home now. A large enough team of dedicated community members cooperating and sharing compute resources online should be able to reproduce any model.
Too bad. The OSI owns "open source".
Big tech has been abusing open source to cheaply capture most of the internet and e-commerce anyway, so perhaps it's time we walked away from the term altogether.
The OSI has abdicated the future of open machine learning. And that's fine. We don't need them.
"Free software" is still a thing and it means a very specific and narrow set of criteria. [1, 2]
There's also "Fair software" [3], which walks the line between CC BY-NC-SA and shareware, but also sticks it to big tech by preventing Redis/Elasticsearch capture by the hyperscalers. There's an open game engine [4] that has a pretty nice "Apache + NC" type license.
---
Back on the main topic of "open machine learning": since the OSI fucked up, I came up with a ten point scale here [5] defining open AI models. It's just a draft, but if other people agree with the idea, I'll publish a website about it (so I'd appreciate your feedback!)
There are ten measures by which a model can/should be open:
1. The model code (pytorch, whatever)
2. The pre-training code
3. The fine-tuning code (which might be very different from the pre-training code)
4. The inference code
5. The raw training data (pre-training + fine-tuning)
6. The processed training data (which might vary across various stages of pre-training and fine-tuning: different sizes, features, batches, etc.)
7. The resultant weights blob(s)
8. The inference inputs and outputs (which also need a license; see also usage limits like O-RAIL)
9. The research paper(s) (hopefully the model is also described and characterized in the literature!)
10. The patents (or lack thereof)
A good open model will have nearly all of these made available. A fake "open" model might only give you two of ten.
---
[1] https://www.fsf.org/
[2] https://en.wikipedia.org/wiki/Free_software
[3] https://fair.io/
[4] https://defold.com/license/
[5] https://news.ycombinator.com/item?id=44438329
> These terms are very well defined. https://opensource.org/osd
Yes. And you're using them wrong.
From the OSD:
< The source code must be the preferred form in which a programmer would modify the program. >
So, what's the preferred way to modify a model? You get the weights and then run fine-tuning with a relatively small amount of data. Which is way cheaper than re-training the entire thing from scratch.
---
The issue is that normal software doesn't have a way to modify the binary artifacts without completely recreating them, and that AI models not only do have that but have a large cost difference. The development lifecycle has nodes that don't exist for normal software.
Which means that really, AI models need their own different terminology that matches that difference. Say, open-weights and open-data or something.
Kinda like how Creative Commons is a thing because software development lifecycle concepts don't map very well to literature or artwork either.
The worst part is that they even lack internal consistency, so they (someone at least) know it's wrong, yet they persist regardless.
> https://www.llama.com/ - "Industry Leading, Open-Source AI"
> https://www.llama.com/llama4/license/ - “Llama Materials” means, collectively, Meta’s proprietary Llama 4
Either the team that built the landing page (Marketing dept?) is wrong, or the legal department is wrong. I'm pretty sure I know who I'd bet on to be more correct.
Meta is indeed leading the gaslighting efforts to convince the press and masses that binary blobs provably built from pirated sources are actually Open Source.
The sad part is it is working. It is almost like Meta is especially skilled at mass public manipulation.
To be fair Zuckerberg is not the worst, "OpenAI".
To be honest, they're more or less the same. Zuckerberg frequently talks proudly of his "open source models" constantly, while OpenAI at least doesn't claim to release open source models when they're not. Still, agree that the name should change, but in my eyes they're equally shit at knowing what "Open" is supposed to mean.
No, because "open source" itself was never clear enough to carry that weight.
That's why we keep being annoying about "Free Software."
> OSI public source code -> Open Source / Free Software
Can we all please stop confusing Free/Libre Open Source with Open Source?
https://www.gnu.org/philosophy/open-source-misses-the-point....
Maybe if we'd focused on communicating the ethics the world wouldn't be so unaware of the differences
Are there any instances of OSI licensed code that are not Free Software making my statement here invalid?
I was attempting to direct that when software is called Open Source and actually is based on OSI licensed sources, then they are likely talking about Free Software.
Since you didn't get a real answer yet: Absolutely. For example, MIT is an OSI license but it's not Free Software.
The last time I heard a comment along those lines I was attending a session by an Open Source person and up on screen they had a picture of RMS dressed as Che Guevara.
All those silly ethics, they get in the way of the real work!
Honestly, I think "the framers" got it right here.
Too much communicating of the ethics would have bogged down the useful legal work.
My take is, Free Software actually won and we're in a post-that world.
I'm not sure I fully understand - whilst I agree there's been useful legal work, we now have such a plethora of licenses I ended up having to back what I'd call basic common sense when someone was suggesting using a highly restrictive "community" license that had ridiculous intents such as saying you can't use it in this particular industry because that industry is "bad".
The reason Free/Libre Open Source Software wins - and always will do in the long run - is because the four freedoms are super-simple and they reflect how the natural world works.
Oh, you and I 100% agree.
I meant the useful legal work too. AKA, get the four freedoms and law in line as much as possible and no more.
You're right, most all the other licences range from "hmm interesting experiment" to "well that's just goofy."
Glad someone else said all this so I did not have to. Hats off to you
The older I get the more I fear I am turning into Richard Stallman, but it is absolutely offensive to continually see corporate proprietary freeware binary blobs built from unlicensed sources be confused with Free Open Source Software which is a dramatically higher bar of investment in public good.
> The older I get the more I fear I am turning into Richard Stallman
You and me both. I always preferred and promoted FLOSS where possible but still had a bit of a pragmatic approach, but now the older I get the more I just want to rip out everything not free (as in freedom) from my life, and/or just go become a goat farmer.
Stallman was right from the beginning, and big tech have proven over and over again that they are incapable of being good citizens.
I'm probably a few more years away from "I'd like to interject for a moment..."
Don’t worry, until you’re chewing on your foot mid lecture, you’re good
I mean the article did mention some legitimately open source models.
Which would have been great to distinguish from the ones that certainly are not.