I recently discovered that ollama no longer uses llama.cpp as a library, and instead they link to the low level library (ggml) which requires them to reinvent a lot of wheel for absolutely no benefit (if there's some benefit I'm missing, please let me know).
Even using llama.cpp as a library seems like an overkill for most use cases. Ollama could make its life much easier by spawning llama-server as a subprocess listening on a unix socket, and forward requests to it.
One thing I'm curious about: Does ollama support strict structured output or strict tool calls adhering to a json schema? Because it would be insane to rely on a server for agentic use unless your server can guarantee the model will only produce valid json. AFAIK this feature is implemented by llama.cpp, which they no longer use.
I got to speak with some of the leads at Ollama and asked more or less this same question. The reason they abandoned llama.cpp is because it does not align with their goals.
llama.cpp is designed to rapidly adopt research-level optimisations and features, but the downside is that reported speeds change all the time (sometimes faster, sometimes slower) and things break really often. You can't hope to establish contracts with simultaneous releases if there is no guarantee the model will even function.
By reimplementing this layer, Ollama gets to enjoy a kind of LTS status that their partners rely on. It won't be as feature-rich, and definitely won't be as fast, but that's not their goal.
Georgi gave a response to some of the issues ollama has in the attached thread[1]
> Looking at ollama's modifications in ggml, they have too much branching in their MXFP4 kernels and the attention sinks implementation is really inefficient. Along with other inefficiencies, I expect the performance is going to be quite bad in ollama.
ollama responded to that
> Ollama has worked to correctly implement MXFP4, and for launch we've worked to validate correctness against the reference implementations against OpenAI's own.
> Will share more later, but here is some testing from the public (@ivanfioravanti
) not done by us - and not paid or
leading to another response
> I am sure you worked hard and did your best.
> But, this ollama TG graph makes no sense - speed cannot increase at larger context. Do you by any chance limit the context to 8k tokens?
> Why is 16k total processing time less than 8k?
Whether or not Ollama's claim is right, I find this "we used your thing, but we know better, we'll share details later" behaviour a bit weird.
What's wrong with using an older well-tested build of llama.cpp, instead of reinventing the wheel? Like every linux distro ever who's ever ran into this issue?
Red Hat doesn't ship the latest build of the linux kernel to production. And Red Hat didn't reinvent the linux kernel for shits and giggles.
> What's wrong with using an older well-tested build of llama.cpp, instead of reinventing the wheel?
Yeah, they tried this, this was the old setup as I understand it. But every time they needed support for a new model and had to update llama.cpp, an old model would break and one of their partners would go ape on them. They said it happened more than once, but one particular case (wish I could remember what it was) was so bad they felt they had no choice but to reimplement. It's the lowest risk strategy.
> Yeah, they tried this, this was the old setup as I understand it. But every time they needed support for a new model and had to update llama.cpp, an old model would break and one of their partners would go ape on them.
Shouldn't any such regressions be regarded as bugs in llama.cpp and fixed there? Surely the Ollama folks can test and benchmark the main models that people care about before shipping the update in a stable release. That would be a lot easier than trying to reimplement major parts of llama.cpp from scratch.
> every time they needed support for a new model and had to update llama.cpp, an old model would break and one of their partners would go ape on them. They said it happened more than once, but one particular case (wish I could remember what it was) was so bad they felt they had no choice but to reimplement. It's the lowest risk strategy.
A much lower risk strategy would be using multiple versions of llama-server to keep supporting old models that would break on newer llama.cpp versions.
The Ollama distribution size is already pretty big (at least on Windows) due to all the GPU support libraries and whatnot. Having to multiple that by the number of llama.cpp versions supported would not be great.
llamacpp> ls -l \*llama\*
-rwxr-xr-x 1 root root 2505480 Aug 7 05:06 libllama.so
-rwxr-xr-x 1 root root 5092024 Aug 7 05:23 llama-server
That's a terrible excuse, Llama.cpp is just 7.5 megabytes. You can easily ship a couple copies of that. The current ollama for windows download is 700MB.
I don't buy it. They're not willing to make an 700MB download a few megabytes bigger to ~730MB, but they are willing to support a fork/rewrite indefinitely (and the fork is outside of their core competency, as seen by the current issues)? What kind of decisionmaking is that?
It’s 700mib because they’re likely redistributing the CUDA libraries so that users don’t need to separately run that installer. Llama.cpp is a bit more “you are expected to know what you’re doing” on that front. But yeah, you could plausibly ship multiple versions of the inference engine although from a maintenance perspective that sounds like hell for any number of reasons
This is a good handwave-y answer for them but truth is they've always been allergic to ever mentioning llama.cpp, even when legally required, they made a political decision instead of an engineering one, and now justify it to themselves and you by handwaving about it somehow being less stable than the core of it, which they still depend on.
A lot of things happened to get to the point they're getting called out aggressively in public on their own repo by nice people, and I hope people don't misread a weak excuse made in conversation as solid rationale, based on innuendo. llama.cpp has been just fine for me, running on CI on every platform you can think of, for 2 years.
EDIT: I can't reply, but, see anoncareer2012's reply.
It's clear you have a better handle on the situation than I do, so it's a shame you weren't the one to talk to them face-to-face.
> llama.cpp has been just fine for me.
Of course, so you really shouldn't use Ollama then.
Ollama isn't a hobby project anymore, they were the only ones at the table with OpenAI many months before the release of GPT-OSS. I honestly don't think they care one bit about the community drama at this point. We don't have to like it, but I guess now they get to shape the narrative. That's their stance, and likely the stance of their industry partners too. I'm just the messenger.
> ...they were the only ones at the table with OpenAI many months before the release of GPT-OSS
In the spirit of TFA:
This isn't true, at all. I don't know where the idea comes from.
You've been repeating this claim frequently. You were corrected on this 2 hours ago. llama.cpp had early access to it just as well.
It's bizarre for several reasons:
1. It is a fantasy that engineering involves seats at tables and bands of brothers growing from a hobby to a ???, one I find appealing and romantic. But, fantasy nonetheless. Additionally, no one mentioned or implied anything about it being a hobby or unserious.
2. Even if it wasn't a fantasy, it's definitely not what happened here. That's what TFA is about, ffs.
No heroics, they got the ultimate embarrassing thing that can happen to a project piggybacking on FOSS: ollama can't work with the materials OpenAI put out to help ollama users because llama.cpp and ollama had separate day 1 landings of code, and ollama has 0 path to forking literally the entire community to use their format. They were working so loosely with OpenAI that OpenAI assumed they were being sane and weren't attempting to use it as an excuse to force a community fork of GGUF and no one realized until after it shipped.
3. I've seen multiple comments from you this afternoon spiking out odd narratives about Ollama and llama.cpp, that don't make sense at their face from the perspective of someone who also deps on llama.cpp. AFAICT you understood the GGML fork as some halcyon moment of freedom / not-hobbiness for a project you root for. That's fine. Unfortunately, reality is intruding, hence TFA. Given you're aware, it makes your humbleness re: knowing whats going on here sound very fake, especially when it precedes another rush of false claims.
4. I think at some point you owe it to even yourself, if not the community, to take a step back and slow down on the misleading claims. I'm seeing more of a gish-gallop than an attempt to recalibrate your technical understanding.
It's been almost 2 hours since you claimed you were sure there were multiple huge breakages due to bad code quality in llama.cpp, and here, we see you reframe that claim as a much weaker one someone else made to you vaguely.
Maybe a good first step to avoiding information pollution here would be to invest time spent repeating other peoples technical claims you didn't understand, and find some of those breakages you know for sure happened, as promised previously.
In general, I sense a passionate but youthful spirit, not an astro-turfer, and this isn't a group of professionals being disrespected because people still think they're a hobby project. Again, that's what the article is about.
Wow, I wasn't expecting this. These are fair critiques, as I am only somewhat informed about what is clearly a very sensitive topic.
For transparency, I attended ICML2025 where Ollama had set up a booth and had a casual conversation with the representatives there (one of whom turned out to lead the Ollama project) before they went to their 2nd birthday celebration. I'm repeating what I can remember from the conversation, about ten minutes or so. I am a researcher not affiliated with the development of llama.cpp or Ollama.
> for a project you root for
I don't use Ollama, and I certainly don't root for it. I'm a little disappointed that people would assume this. I also don't use llama.cpp and it seems that is the problem. I'm not really interested in the drama, I just want to understand what these projects want to do. I work in theory and try to stay up to date on how the general public can run LLMs locally.
> no one realized until after it shipped.
I'm not sensing that the devs at Ollama are particularly competent, especially when compared to the behemoths at llama.cpp. To me, this helps explain why their actions differ from their claimed motivation, but this is probably because I prefer to assume incompetence over something sinister.
> as promised previously...
I don't think I made any such promises. I can't cite those claimed breakages, because I do not remember further details from the conversation needed to find them. The guy made a strong point to claim they had happened and there was enough frustrated rambling there to believe him. If I had more, I would have cited them. I remember seeing news regarding the deprecation of multimodal support hence the "I could swear that" comment (although I regret this language, and wish I could edit the comment to tone it down a bit), but I do not think this was what the rep cited. I had hoped that someone could fill in the blanks there, but if knowledgeable folks claim this is impossible (which is hard to believe for a project of this size, but I digress), I defer to their expert opinion here.
> llama.cpp had early access to it just as well.
I knew this from the conversation, but was told Ollama had even earlier discussions with OpenAI as the initial point of contact. Again, this is what I was told, so feel free to critique it. At that time, the rep could not explicitly disclose that it was OpenAI, but it was pretty obvious from the timing due to the delay.
> avoiding information pollution
I'm a big believer in free speech and that the truth will always come out eventually.
> I'm seeing more of a gish-gallop than an attempt to recalibrate your technical understanding...
> I sense a passionate but youthful spirit, not an astro-turfer,
This is pretty humbling, and frankly comes off a little patronizing, but I suppose this is what happens when I step out of my lane. My objective was to stimulate further conversation and share a perspective I thought was unique. I can see this was not welcome, my apologies.
You should revisit this in a year, I think you'll understand how you came off a bit better. TL;DR: horrible idea to show up gossiping to cast aspersions, then disclaim responsibility because it's by proxy and you didn't understand, on a professional specialist site.
Your rationale for your choices doesn't really matter, you made the choice to cast aspersions, repeatedly, on multiple stories in multiple comments.
Handwaving about how the truth will come out through discussion, even when you repeatedly cast aspersions you disclaim understanding of, while also expressing surprise you got a reply, while also always making up more stuff in service of justifying the initial aspersions is indicative of how coherent your approach seems to the rest of us.
In general, I understand why you feel patronized. Between the general incoherence, this thread, and the previous thread where you're applying not-even-wrong, in the Pauli sense, concepts like LTS and Linux kernel dev to this situation, the only real choice that lets anyone assume the best of you is you're 15-25 and don't have a great grasp of tech yet.
Otherwise, you're just some guy gossiping, getting smarmy about it, with enough of an IQ to explain back to yourself why you didn't mean to do it. 0 idea why someone older and technical would do all that on behalf of software they don't use and don't understand.
this looks ok on paper, but isn't realized in reality. ollama is full of bugs, problems and issues llama.cpp has solved ages ago. this thread is a good example of that.
As someone that has participated in llama.cpp development, it's simple, Ollama doesn't want to give credit to llama.cpp. If llama.cpp went closed, Ollama would fall behind, they blatantly rip llama.cpp. Who cares tho? All they have to say is "powered by llama.cpp" It won't drive most users away from Ollama, most folks will prefer ollama and power users will prefer llama.cpp. But their ego won't let them.
On llama.cpp breaking things, that's the pace of innovation. It feels like a new model with a new architecture is being released every week. Guess what? The same things we saw with drivers for Unix systems back in the day, no documentation. So implementation is based on whatever can be figured from the arxiv paper, other implementations transformers/vllm (python -> C), quite often these models released from labs are "broken", jinja.template ain't easy! Bad templates will break the model generation, tool calling, agentic flow, etc. Folks will sometimes blame llama.cpp, sometimes the implementation is correct but the issue is that since it's main format is guff and anyone can generate a gguf, quite often experimental gguf is generated and released by folks excited to be the first to try a new model. Then llama.cpp gets the blame.
Thank you. This is genuinely a valid reason even from a simple consistency perspective.
(edit: I think -- after I read some of the links -- I understand why Ollama comes across as less of a hero. Still, I am giving them some benefit of the doubt since they made local models very accessible to plebs like me; and maybe I can graduate to no ollama )
Hmm? I would argue against that line of argumentation. It is ridiculously easy to start out of box and working. Once the user starts moving against obvious restrictions resulting from the trade-offs in defaults, they can move on to something more custom. Woulnd't that be the definition of beginner friendly?
I am biased since I effectively started with Ollama as my main local llm so take this response for what it is.
Still, you got me curious. Which defaults you consider hostile ( not disagreeing; this is pure curiosity )?
It'd be easy enough for ollama alternatives -- they just need to make a CLI front end that lets you run a model with reasonable efficiency without passing any flags. That's really ollama's value, as far as I can tell.
> I recently discovered that ollama no longer uses llama.cpp as a library, and instead they link to the low level library (ggml) which requires them to reinvent a lot of wheel for absolutely no benefit (if there's some benefit I'm missing, please let me know).
It is not true that Ollama doesn't use llama.cpp anymore. They built their own library, which is the default, but also really far from being feature complete. If a model is not supported by their library, they fall back to llama.cpp. For example, there is a group of people trying to get the new IBM models working with Ollama [1]. Their quick/short term solution is to bump the version of llama.cpp included with Ollama to a newer version that has support. And then at a later time, add support in Ollama's library.
> Does ollama support strict structured output or strict tool calls adhering to a json schema?
As far as I understand this is generally not possible at the model level. Best you can do is wrap the call in a (non-llm) json schema validator, and emit an error json in case the llm output does not match the schema, which is what some APIs do for you, but not very complicated to do yourself.
The inference engine (llama.CPP) has full control over the possible tokens during inference. It can "force" the llm to output only valid tokens so that it produces valid json
no that's incorrect - llama.cpp has support for providing a context free grammar while sampling and only samples tokens that would conform to the grammar, rather than sampling tokens that would violate the grammar
This is misinformation. Ollama’s supported structured outputs that conform to a given JSON-schema for months. Here’s a post about this from last year: https://ollama.com/blog/structured-outputs
This is absolutely possible to do at the model level via logit shaping. Llama-cpp’s functionality for this is called GBNF. It’s tightly integrated into the token sampling infrastructure, and is what ollama builds upon for their json schema functionality.
> It’s tightly integrated into the token sampling infrastructure, and is what ollama builds upon for their json schema functionality.
Do you mean the functionality of generating ebnf grammar and from a json schema use it for sampling is part of ggml, and all they have to do is use it?
I assumed that this was part of llama.cpp, and another feature they have to re-implement and maintain.
The whole point of GBNF is to serve as part of the API that lets downstream applications control token sampling in a high-level way without having to drop to raw logit distributions or pull model-specific tricks.
This thread was about doing structured generation in a model-agnostic way without wrapping try/except around json.parse(), and GBNF is _the_ way to do that.
>(if there's some benefit I'm missing, please let me know).
Makes their VCs think they're doing more, and have more ownership, rather than being a do-nothing wrapper with some analytics and S3 buckets that rehost models from HF.
> Ollama could make its life much easier by spawning llama-server as a subprocess listening on a unix socket, and forward requests to it
I'd recommend taking a look at https://github.com/containers/ramalama its more similar to what you're describing in the way it uses llama-server, also it is container native by default which is nice for portability.
ggerganov is my hero, and...
it's a good thing this got posted so I saw in the comments that --flash-attn --cache-reuse 256 could help with my setup (M3 36GB + RPC to M1 16GB) figuring out what params to set and at what value is a lot of trial and error, Gemini does help a bit clarify what params like top-k are going to do in practice. Still the whole load-balancing with RPC is something I think I'm going to have to read the source of llama.cpp to really understand (oops I almost wrote grok, damn you Elon) Anyways ollama is still not doing distributed load, and yeah I guess using it is a stepping stone...
This is the comment people should read. GG is amazing.
Ollama forked to get it working for day 1 compatibility. They need to get their system back in line with mainline because of that choice. That's kinda how open source works.
The uproar over this (mostly on reddit and x) seems unwarranted. New models regularly have compatibility issues for much longer than this.
The named anchor in this URL doesn't work in Safari. Safari correctly scrolls down to the comment in question, but then some Javascript on the page throws you back up to the top again.
I noticed it the other way, llama.cpp failed to download the Ollama-downloaded gpt-oss 20b model. Thought it was odd given all the others I tried worked fine.
Figured it had to be Ollama doing Ollama things, seems that was indeed the case.
Sadly it is not and the issue still remains open after over a year meaning ollama cannot run the latest SOTA open source models unless they covert them to their proprietary format which they do not consistently do.
No surprise I guess given they've taken VC money, refuse to properly attribute the use things like llama.cpp and ggml, have their own model format for.. reasons? and have over 1800 open issues...
Llama-server, ramallama or whatever model switcher ggerganov is working on (he showed previews recently) feel like the way forward.
I want to add an inference engine to my product. I was hoping to use ollama because it really helps, I think, make sure you have a model with the right metadata that you can count on working (I've seen that with llama.cpp, it's easy to get the metadata wrong and start getting rubbish from the LLM because the "stop_token" was wrong or something). I'd thought ollama was a proponent of the GGUF, which I really like as it standardizes metadata?!
What would be the best way to use llama.cpp and models that use GGUF these days? ramallama is a good alternative (I guess it is, but it's not completely clear from your message)? Or just use llama.cpp directly, in which case how to ensure I don't get rubbish (like the model asking and answering questions by itself without ever stopping)??
Meant to say llama-swap instead of llama-server. llama-swap adds a gui and dynamic model switching on top of llama-server. Somewhat tricky to set up as it relies on a .yaml file that is poorly documented for using with docker but something like:
When run via docker this gets you a similar setup to ollama. The yaml file also needs TTL set if you want it to unload models after an idle period.
Ollama native models in their marketplace have these params supposedly set correctly to save you having to do this config but in practice this is hit or miss and often these change from day 0 of the release.
Why is anyone still using this? You can spin up llama.cpp server and have more optimized runtime. And if you insist on containers you can go for ramallama https://ramalama.ai/
Ollama is a more out of the box solution. I also prefer llama.cpp for the more FOSS aspects, but Ollama is a simpler install, model download (this is the biggest convenience IMO), and execution. For those reasons, that's why I believe it's still fairly popular as a solution.
By the way, you can download models straight from hugging face with llama.cpp.
It might be a few characters longer than the command you would run on ollama, but still.
Then you need to also provide appropriate metadata and format messages correctly according to the format. Which I believe llama.cpp doesn’t do by default, or it can do it? I had trouble formatting messages correctly using llama.cpp due to possibly mismatch in metadata, which ollama seems to handle, but would love to know if this is wrong.
Plus a huggingface token to access models that require you to beg for approval. Ollama hosted models don't require that (which may not be legit but most users don't care).
You can, but you have to know where to look, and you have to have some idea of what you're doing. The benefit of Ollama is that the barrier to entry is really low, as long as you have the right hardware.
To me, one of the benefits of running a model locally is learning how all this stuff works, so Ollama never had any appeal. But most people just want stuff to work without putting in the effort to understand how it all fits together. Ollama meets that demand.
I disagree that Ollama is easier to install. I tried to enable Vulkan on Ollama and it is nightmarish, even though the underlying llama.cpp code supports it with a simple envar. Ollama was easy 2 years ago, but has been progressively getting worse over time.
I think people just don't know any better. I also used Ollama way longer than I should have. I didn't know that Ollama was just llama.cpp with a thin wrapper. My quality of life improved a lot after I discovered llama.cpp.
I think the title buries the lede? Its specific to GPT-OSS and exposes the shady stuff Ollama is doing to acquiesce/curry favor/partner with/get paid by corporate interests
I think "shady" is a little too harsh - sounds like they forked an important upstream project, made incompatible changes that they didn't push upstream or even communicate with upstream about, and now have to deal with the consequences of that. If that's "shady" (despite being all out in the open) then nearly every company I've worked for has been "shady."
There’s a GitHub link which is open from last year, about the missing license in ollama. They have not bothered to reply, which goes to show how much they care. Also it’s a YC company, I see more and more morally bankrupt companies making the cut recently, why is that?
Just days ago ollama devs claimed[0] that ollama no longer relies on ggml / llama.cpp. here is their pull request(+165,966 −47,980) to reimplement (copy) llama.cpp code in their repository.
The PR you linked to says “thanks to the amazing work done by ggml-org” and doesn’t remove GGML code, it instead updates the vendored version and seems to throw away ollama’s custom changes. That’s the opposite of disentangling.
not against overall sentiment here, but quote the counterpoint from the linked HN comment to be fair:
> Ollama does not use llama.cpp anymore; we do still keep it and occasionally update it to remain compatible for older models for when we used it.
The linked PR is doing "occasionally update it" I guess? Note that "vendored" in the PR title often means to take a snapshot to pin a specific version.
I disagree. Ollama’s reason to be is to make things simple, not to always be on the cutting edge. I use Ollama when I can because of this simplicity. Since I bought a 32G integrated memory Mac 18 months ago, I have run so many models using Ollama, with close to zero problems.
The simple thing to do is to just use the custom quantization that OpenAI used for gpt-oss and use GGUF for other models.
Using Huggingface, LM Studio, etc. is the Linux metaphor of flexibility. Using Ollama is sort of like using macOS
What is the currently favored alternative for simply running 1-2 models locally, exposed via an API? One big advantage of Ollama seems to be that they provide fully configured models, so I don't have to fiddle with stop words, etc.
for folks wrestling with Ollama, llama.cpp or local LLM versioning - did you guys check out Docker's new feature - Docker Model Runner?
Docker Model Runner makes it easy to manage, run, and deploy AI models using Docker. Designed for developers, Docker Model Runner streamlines the process of pulling, running, and serving large language models (LLMs) and other AI models directly from Docker Hub or any OCI-compliant registry.
Whether you're building generative AI applications, experimenting with machine learning workflows, or integrating AI into your software development lifecycle, Docker Model Runner provides a consistent, secure, and efficient way to work with AI models locally.
I recently discovered that ollama no longer uses llama.cpp as a library, and instead they link to the low level library (ggml) which requires them to reinvent a lot of wheel for absolutely no benefit (if there's some benefit I'm missing, please let me know).
Even using llama.cpp as a library seems like an overkill for most use cases. Ollama could make its life much easier by spawning llama-server as a subprocess listening on a unix socket, and forward requests to it.
One thing I'm curious about: Does ollama support strict structured output or strict tool calls adhering to a json schema? Because it would be insane to rely on a server for agentic use unless your server can guarantee the model will only produce valid json. AFAIK this feature is implemented by llama.cpp, which they no longer use.
I got to speak with some of the leads at Ollama and asked more or less this same question. The reason they abandoned llama.cpp is because it does not align with their goals.
llama.cpp is designed to rapidly adopt research-level optimisations and features, but the downside is that reported speeds change all the time (sometimes faster, sometimes slower) and things break really often. You can't hope to establish contracts with simultaneous releases if there is no guarantee the model will even function.
By reimplementing this layer, Ollama gets to enjoy a kind of LTS status that their partners rely on. It won't be as feature-rich, and definitely won't be as fast, but that's not their goal.
Georgi gave a response to some of the issues ollama has in the attached thread[1]
> Looking at ollama's modifications in ggml, they have too much branching in their MXFP4 kernels and the attention sinks implementation is really inefficient. Along with other inefficiencies, I expect the performance is going to be quite bad in ollama.
ollama responded to that
> Ollama has worked to correctly implement MXFP4, and for launch we've worked to validate correctness against the reference implementations against OpenAI's own. > Will share more later, but here is some testing from the public (@ivanfioravanti ) not done by us - and not paid or
leading to another response
> I am sure you worked hard and did your best. > But, this ollama TG graph makes no sense - speed cannot increase at larger context. Do you by any chance limit the context to 8k tokens? > Why is 16k total processing time less than 8k?
Whether or not Ollama's claim is right, I find this "we used your thing, but we know better, we'll share details later" behaviour a bit weird.
[1] https://x.com/ggerganov/status/1953088008816619637
ollama has always had a weird attitude towards upstream, and then they wonder why many in the community don't like them
> they wonder why many in the community don't like them
Do they? They probably care more about their "partners".
As GP said:
That's a dumb answer from them.
What's wrong with using an older well-tested build of llama.cpp, instead of reinventing the wheel? Like every linux distro ever who's ever ran into this issue?
Red Hat doesn't ship the latest build of the linux kernel to production. And Red Hat didn't reinvent the linux kernel for shits and giggles.
The Linux kernel does not break userspace.
> What's wrong with using an older well-tested build of llama.cpp, instead of reinventing the wheel?
Yeah, they tried this, this was the old setup as I understand it. But every time they needed support for a new model and had to update llama.cpp, an old model would break and one of their partners would go ape on them. They said it happened more than once, but one particular case (wish I could remember what it was) was so bad they felt they had no choice but to reimplement. It's the lowest risk strategy.
> Yeah, they tried this, this was the old setup as I understand it. But every time they needed support for a new model and had to update llama.cpp, an old model would break and one of their partners would go ape on them.
Shouldn't any such regressions be regarded as bugs in llama.cpp and fixed there? Surely the Ollama folks can test and benchmark the main models that people care about before shipping the update in a stable release. That would be a lot easier than trying to reimplement major parts of llama.cpp from scratch.
> every time they needed support for a new model and had to update llama.cpp, an old model would break and one of their partners would go ape on them. They said it happened more than once, but one particular case (wish I could remember what it was) was so bad they felt they had no choice but to reimplement. It's the lowest risk strategy.
A much lower risk strategy would be using multiple versions of llama-server to keep supporting old models that would break on newer llama.cpp versions.
The Ollama distribution size is already pretty big (at least on Windows) due to all the GPU support libraries and whatnot. Having to multiple that by the number of llama.cpp versions supported would not be great.
?
That's a terrible excuse, Llama.cpp is just 7.5 megabytes. You can easily ship a couple copies of that. The current ollama for windows download is 700MB.I don't buy it. They're not willing to make an 700MB download a few megabytes bigger to ~730MB, but they are willing to support a fork/rewrite indefinitely (and the fork is outside of their core competency, as seen by the current issues)? What kind of decisionmaking is that?
It’s 700mib because they’re likely redistributing the CUDA libraries so that users don’t need to separately run that installer. Llama.cpp is a bit more “you are expected to know what you’re doing” on that front. But yeah, you could plausibly ship multiple versions of the inference engine although from a maintenance perspective that sounds like hell for any number of reasons
> llama.cpp is designed to rapidly adopt research-level optimisations and features, but the downside is that reported speeds change all the time
Ironic that (according to the article) ollama rushed to implement GPT-OSS support, and thus broke the rest of the gguf quants (iiuc correctly).
This is a good handwave-y answer for them but truth is they've always been allergic to ever mentioning llama.cpp, even when legally required, they made a political decision instead of an engineering one, and now justify it to themselves and you by handwaving about it somehow being less stable than the core of it, which they still depend on.
A lot of things happened to get to the point they're getting called out aggressively in public on their own repo by nice people, and I hope people don't misread a weak excuse made in conversation as solid rationale, based on innuendo. llama.cpp has been just fine for me, running on CI on every platform you can think of, for 2 years.
EDIT: I can't reply, but, see anoncareer2012's reply.
It's clear you have a better handle on the situation than I do, so it's a shame you weren't the one to talk to them face-to-face.
> llama.cpp has been just fine for me.
Of course, so you really shouldn't use Ollama then.
Ollama isn't a hobby project anymore, they were the only ones at the table with OpenAI many months before the release of GPT-OSS. I honestly don't think they care one bit about the community drama at this point. We don't have to like it, but I guess now they get to shape the narrative. That's their stance, and likely the stance of their industry partners too. I'm just the messenger.
> ...they were the only ones at the table with OpenAI many months before the release of GPT-OSS
In the spirit of TFA:
This isn't true, at all. I don't know where the idea comes from.
You've been repeating this claim frequently. You were corrected on this 2 hours ago. llama.cpp had early access to it just as well.
It's bizarre for several reasons:
1. It is a fantasy that engineering involves seats at tables and bands of brothers growing from a hobby to a ???, one I find appealing and romantic. But, fantasy nonetheless. Additionally, no one mentioned or implied anything about it being a hobby or unserious.
2. Even if it wasn't a fantasy, it's definitely not what happened here. That's what TFA is about, ffs.
No heroics, they got the ultimate embarrassing thing that can happen to a project piggybacking on FOSS: ollama can't work with the materials OpenAI put out to help ollama users because llama.cpp and ollama had separate day 1 landings of code, and ollama has 0 path to forking literally the entire community to use their format. They were working so loosely with OpenAI that OpenAI assumed they were being sane and weren't attempting to use it as an excuse to force a community fork of GGUF and no one realized until after it shipped.
3. I've seen multiple comments from you this afternoon spiking out odd narratives about Ollama and llama.cpp, that don't make sense at their face from the perspective of someone who also deps on llama.cpp. AFAICT you understood the GGML fork as some halcyon moment of freedom / not-hobbiness for a project you root for. That's fine. Unfortunately, reality is intruding, hence TFA. Given you're aware, it makes your humbleness re: knowing whats going on here sound very fake, especially when it precedes another rush of false claims.
4. I think at some point you owe it to even yourself, if not the community, to take a step back and slow down on the misleading claims. I'm seeing more of a gish-gallop than an attempt to recalibrate your technical understanding.
It's been almost 2 hours since you claimed you were sure there were multiple huge breakages due to bad code quality in llama.cpp, and here, we see you reframe that claim as a much weaker one someone else made to you vaguely.
Maybe a good first step to avoiding information pollution here would be to invest time spent repeating other peoples technical claims you didn't understand, and find some of those breakages you know for sure happened, as promised previously.
In general, I sense a passionate but youthful spirit, not an astro-turfer, and this isn't a group of professionals being disrespected because people still think they're a hobby project. Again, that's what the article is about.
Wow, I wasn't expecting this. These are fair critiques, as I am only somewhat informed about what is clearly a very sensitive topic.
For transparency, I attended ICML2025 where Ollama had set up a booth and had a casual conversation with the representatives there (one of whom turned out to lead the Ollama project) before they went to their 2nd birthday celebration. I'm repeating what I can remember from the conversation, about ten minutes or so. I am a researcher not affiliated with the development of llama.cpp or Ollama.
> for a project you root for
I don't use Ollama, and I certainly don't root for it. I'm a little disappointed that people would assume this. I also don't use llama.cpp and it seems that is the problem. I'm not really interested in the drama, I just want to understand what these projects want to do. I work in theory and try to stay up to date on how the general public can run LLMs locally.
> no one realized until after it shipped.
I'm not sensing that the devs at Ollama are particularly competent, especially when compared to the behemoths at llama.cpp. To me, this helps explain why their actions differ from their claimed motivation, but this is probably because I prefer to assume incompetence over something sinister.
> as promised previously...
I don't think I made any such promises. I can't cite those claimed breakages, because I do not remember further details from the conversation needed to find them. The guy made a strong point to claim they had happened and there was enough frustrated rambling there to believe him. If I had more, I would have cited them. I remember seeing news regarding the deprecation of multimodal support hence the "I could swear that" comment (although I regret this language, and wish I could edit the comment to tone it down a bit), but I do not think this was what the rep cited. I had hoped that someone could fill in the blanks there, but if knowledgeable folks claim this is impossible (which is hard to believe for a project of this size, but I digress), I defer to their expert opinion here.
> llama.cpp had early access to it just as well.
I knew this from the conversation, but was told Ollama had even earlier discussions with OpenAI as the initial point of contact. Again, this is what I was told, so feel free to critique it. At that time, the rep could not explicitly disclose that it was OpenAI, but it was pretty obvious from the timing due to the delay.
> avoiding information pollution
I'm a big believer in free speech and that the truth will always come out eventually.
> I'm seeing more of a gish-gallop than an attempt to recalibrate your technical understanding...
> I sense a passionate but youthful spirit, not an astro-turfer,
This is pretty humbling, and frankly comes off a little patronizing, but I suppose this is what happens when I step out of my lane. My objective was to stimulate further conversation and share a perspective I thought was unique. I can see this was not welcome, my apologies.
You should revisit this in a year, I think you'll understand how you came off a bit better. TL;DR: horrible idea to show up gossiping to cast aspersions, then disclaim responsibility because it's by proxy and you didn't understand, on a professional specialist site.
Your rationale for your choices doesn't really matter, you made the choice to cast aspersions, repeatedly, on multiple stories in multiple comments.
Handwaving about how the truth will come out through discussion, even when you repeatedly cast aspersions you disclaim understanding of, while also expressing surprise you got a reply, while also always making up more stuff in service of justifying the initial aspersions is indicative of how coherent your approach seems to the rest of us.
In general, I understand why you feel patronized. Between the general incoherence, this thread, and the previous thread where you're applying not-even-wrong, in the Pauli sense, concepts like LTS and Linux kernel dev to this situation, the only real choice that lets anyone assume the best of you is you're 15-25 and don't have a great grasp of tech yet.
Otherwise, you're just some guy gossiping, getting smarmy about it, with enough of an IQ to explain back to yourself why you didn't mean to do it. 0 idea why someone older and technical would do all that on behalf of software they don't use and don't understand.
Ollama isn't anything anymore. Ollama had a value proposition a year ago, they don't have one now.
And also, Ollama is claiming not to use llama.cpp, despite continuing to use llama.cpp.
this looks ok on paper, but isn't realized in reality. ollama is full of bugs, problems and issues llama.cpp has solved ages ago. this thread is a good example of that.
As someone that has participated in llama.cpp development, it's simple, Ollama doesn't want to give credit to llama.cpp. If llama.cpp went closed, Ollama would fall behind, they blatantly rip llama.cpp. Who cares tho? All they have to say is "powered by llama.cpp" It won't drive most users away from Ollama, most folks will prefer ollama and power users will prefer llama.cpp. But their ego won't let them.
On llama.cpp breaking things, that's the pace of innovation. It feels like a new model with a new architecture is being released every week. Guess what? The same things we saw with drivers for Unix systems back in the day, no documentation. So implementation is based on whatever can be figured from the arxiv paper, other implementations transformers/vllm (python -> C), quite often these models released from labs are "broken", jinja.template ain't easy! Bad templates will break the model generation, tool calling, agentic flow, etc. Folks will sometimes blame llama.cpp, sometimes the implementation is correct but the issue is that since it's main format is guff and anyone can generate a gguf, quite often experimental gguf is generated and released by folks excited to be the first to try a new model. Then llama.cpp gets the blame.
Feels like BS I guess wrapping 2 or even more versions should not be that much of a problem.
There was drama that ollama doesn’t credit llama.cpp and most likely crediting it was „not aligning with their goals”.
Thank you. This is genuinely a valid reason even from a simple consistency perspective.
(edit: I think -- after I read some of the links -- I understand why Ollama comes across as less of a hero. Still, I am giving them some benefit of the doubt since they made local models very accessible to plebs like me; and maybe I can graduate to no ollama )
I think this is the thing: if you can use llama.cpp, you probably shouldn't use Ollama. It's designed for the beginner.
You shouldn't use Ollama as a beginner either. It comes with crazy begginer-hostile defaults out of the box.
Hmm? I would argue against that line of argumentation. It is ridiculously easy to start out of box and working. Once the user starts moving against obvious restrictions resulting from the trade-offs in defaults, they can move on to something more custom. Woulnd't that be the definition of beginner friendly?
I am biased since I effectively started with Ollama as my main local llm so take this response for what it is.
Still, you got me curious. Which defaults you consider hostile ( not disagreeing; this is pure curiosity )?
> it does not align with their goals
Ollama is a scam trying to E-E-E the rising hype wave of local LLMs while the getting is still good.
Sorry, but somebody has to voice the elephant in the room here.
It'd be easy enough for ollama alternatives -- they just need to make a CLI front end that lets you run a model with reasonable efficiency without passing any flags. That's really ollama's value, as far as I can tell.
Ollama itself doesn't pass that test. (Broken context settings, non-standard formats and crazy model names.)
I haven't experienced this personally, but I have stuck with pretty mainstream models like llama, gemma, deepseek, etc.
> I recently discovered that ollama no longer uses llama.cpp as a library, and instead they link to the low level library (ggml) which requires them to reinvent a lot of wheel for absolutely no benefit (if there's some benefit I'm missing, please let me know).
Here is some relevant drama on the subject:
https://github.com/ollama/ollama/issues/11714#issuecomment-3...
It is not true that Ollama doesn't use llama.cpp anymore. They built their own library, which is the default, but also really far from being feature complete. If a model is not supported by their library, they fall back to llama.cpp. For example, there is a group of people trying to get the new IBM models working with Ollama [1]. Their quick/short term solution is to bump the version of llama.cpp included with Ollama to a newer version that has support. And then at a later time, add support in Ollama's library.
1) https://github.com/ollama/ollama/issues/10557
> Does ollama support strict structured output or strict tool calls adhering to a json schema?
As far as I understand this is generally not possible at the model level. Best you can do is wrap the call in a (non-llm) json schema validator, and emit an error json in case the llm output does not match the schema, which is what some APIs do for you, but not very complicated to do yourself.
Someone correct me if I'm wrong
The inference engine (llama.CPP) has full control over the possible tokens during inference. It can "force" the llm to output only valid tokens so that it produces valid json
and in fact leverages that control to constrain outputs to those matching user-specified BNFs
https://github.com/ggml-org/llama.cpp/tree/master/grammars
Very cool!
Ahh, I stand corrected, very cool!
no that's incorrect - llama.cpp has support for providing a context free grammar while sampling and only samples tokens that would conform to the grammar, rather than sampling tokens that would violate the grammar
Very interesting, thank you!
This is misinformation. Ollama’s supported structured outputs that conform to a given JSON-schema for months. Here’s a post about this from last year: https://ollama.com/blog/structured-outputs
This is absolutely possible to do at the model level via logit shaping. Llama-cpp’s functionality for this is called GBNF. It’s tightly integrated into the token sampling infrastructure, and is what ollama builds upon for their json schema functionality.
> It’s tightly integrated into the token sampling infrastructure, and is what ollama builds upon for their json schema functionality.
Do you mean the functionality of generating ebnf grammar and from a json schema use it for sampling is part of ggml, and all they have to do is use it?
I assumed that this was part of llama.cpp, and another feature they have to re-implement and maintain.
The whole point of GBNF is to serve as part of the API that lets downstream applications control token sampling in a high-level way without having to drop to raw logit distributions or pull model-specific tricks.
Ollama has a hardcoded GBNF grammar to force generic json output for example, the code is here: https://github.com/ollama/ollama/blob/da09488fbfc437c55a94bc...
Ollama can also turn user-passed json schema into a more tightly specified GBNF grammar, the code is here and is a bit harder to understand: https://github.com/ollama/ollama/blob/da09488fbfc437c55a94bc...
This thread was about doing structured generation in a model-agnostic way without wrapping try/except around json.parse(), and GBNF is _the_ way to do that.
>(if there's some benefit I'm missing, please let me know).
Makes their VCs think they're doing more, and have more ownership, rather than being a do-nothing wrapper with some analytics and S3 buckets that rehost models from HF.
> Ollama could make its life much easier by spawning llama-server as a subprocess listening on a unix socket, and forward requests to it
I'd recommend taking a look at https://github.com/containers/ramalama its more similar to what you're describing in the way it uses llama-server, also it is container native by default which is nice for portability.
ggerganov explains the issue: https://github.com/ollama/ollama/issues/11714#issuecomment-3...
ggerganov is my hero, and... it's a good thing this got posted so I saw in the comments that --flash-attn --cache-reuse 256 could help with my setup (M3 36GB + RPC to M1 16GB) figuring out what params to set and at what value is a lot of trial and error, Gemini does help a bit clarify what params like top-k are going to do in practice. Still the whole load-balancing with RPC is something I think I'm going to have to read the source of llama.cpp to really understand (oops I almost wrote grok, damn you Elon) Anyways ollama is still not doing distributed load, and yeah I guess using it is a stepping stone...
This is the comment people should read. GG is amazing.
Ollama forked to get it working for day 1 compatibility. They need to get their system back in line with mainline because of that choice. That's kinda how open source works.
The uproar over this (mostly on reddit and x) seems unwarranted. New models regularly have compatibility issues for much longer than this.
GG clearly mentioned they did not contribute anything to upstream.
The named anchor in this URL doesn't work in Safari. Safari correctly scrolls down to the comment in question, but then some Javascript on the page throws you back up to the top again.
I noticed it the other way, llama.cpp failed to download the Ollama-downloaded gpt-oss 20b model. Thought it was odd given all the others I tried worked fine.
Figured it had to be Ollama doing Ollama things, seems that was indeed the case.
Oh wow...
> llama.cpp failed to download the Ollama-downloaded gpt-oss 20b model
That should of course read
"llama.cpp failed to load the Ollama-downloaded gpt-oss 20b model"
ggerganov is a treasure. the man deserves a medal.
Confusing title - thought this was about Ollama finally supporting sharded GGUF (ie. the Huggingface default for large gguf over 48gb).
https://github.com/ollama/ollama/issues/5245
Sadly it is not and the issue still remains open after over a year meaning ollama cannot run the latest SOTA open source models unless they covert them to their proprietary format which they do not consistently do.
No surprise I guess given they've taken VC money, refuse to properly attribute the use things like llama.cpp and ggml, have their own model format for.. reasons? and have over 1800 open issues...
Llama-server, ramallama or whatever model switcher ggerganov is working on (he showed previews recently) feel like the way forward.
I want to add an inference engine to my product. I was hoping to use ollama because it really helps, I think, make sure you have a model with the right metadata that you can count on working (I've seen that with llama.cpp, it's easy to get the metadata wrong and start getting rubbish from the LLM because the "stop_token" was wrong or something). I'd thought ollama was a proponent of the GGUF, which I really like as it standardizes metadata?!
What would be the best way to use llama.cpp and models that use GGUF these days? ramallama is a good alternative (I guess it is, but it's not completely clear from your message)? Or just use llama.cpp directly, in which case how to ensure I don't get rubbish (like the model asking and answering questions by itself without ever stopping)??
Meant to say llama-swap instead of llama-server. llama-swap adds a gui and dynamic model switching on top of llama-server. Somewhat tricky to set up as it relies on a .yaml file that is poorly documented for using with docker but something like:
When run via docker this gets you a similar setup to ollama. The yaml file also needs TTL set if you want it to unload models after an idle period.Ollama native models in their marketplace have these params supposedly set correctly to save you having to do this config but in practice this is hit or miss and often these change from day 0 of the release.
This title makes no sense and it links nowhere helpful.
It's "Ollama's forked ggml is incompatible with other gpt-oss GGUFs"
and it should link to GG's comment[0]
[0] https://github.com/ollama/ollama/issues/11714#issuecomment-3...
HN strips any "?|#" etc at the end of URLs, and you cannot edit the URL after submissions (like you can the title).
As far as the title, yeah I didn't work too hard on making it good, sorry
Why is anyone still using this? You can spin up llama.cpp server and have more optimized runtime. And if you insist on containers you can go for ramallama https://ramalama.ai/
Ollama is a more out of the box solution. I also prefer llama.cpp for the more FOSS aspects, but Ollama is a simpler install, model download (this is the biggest convenience IMO), and execution. For those reasons, that's why I believe it's still fairly popular as a solution.
By the way, you can download models straight from hugging face with llama.cpp. It might be a few characters longer than the command you would run on ollama, but still.
Then you need to also provide appropriate metadata and format messages correctly according to the format. Which I believe llama.cpp doesn’t do by default, or it can do it? I had trouble formatting messages correctly using llama.cpp due to possibly mismatch in metadata, which ollama seems to handle, but would love to know if this is wrong.
Plus a huggingface token to access models that require you to beg for approval. Ollama hosted models don't require that (which may not be legit but most users don't care).
You can, but you have to know where to look, and you have to have some idea of what you're doing. The benefit of Ollama is that the barrier to entry is really low, as long as you have the right hardware.
To me, one of the benefits of running a model locally is learning how all this stuff works, so Ollama never had any appeal. But most people just want stuff to work without putting in the effort to understand how it all fits together. Ollama meets that demand.
I disagree that Ollama is easier to install. I tried to enable Vulkan on Ollama and it is nightmarish, even though the underlying llama.cpp code supports it with a simple envar. Ollama was easy 2 years ago, but has been progressively getting worse over time.
I think people just don't know any better. I also used Ollama way longer than I should have. I didn't know that Ollama was just llama.cpp with a thin wrapper. My quality of life improved a lot after I discovered llama.cpp.
I think the title buries the lede? Its specific to GPT-OSS and exposes the shady stuff Ollama is doing to acquiesce/curry favor/partner with/get paid by corporate interests
I think "shady" is a little too harsh - sounds like they forked an important upstream project, made incompatible changes that they didn't push upstream or even communicate with upstream about, and now have to deal with the consequences of that. If that's "shady" (despite being all out in the open) then nearly every company I've worked for has been "shady."
There's a reddit thread from a few months ago that sort of explains what people don't like about ollama, that "shadiness" parent references:
https://www.reddit.com/r/LocalLLaMA/comments/1jzocoo/finally...
There’s a GitHub link which is open from last year, about the missing license in ollama. They have not bothered to reply, which goes to show how much they care. Also it’s a YC company, I see more and more morally bankrupt companies making the cut recently, why is that?
I think most of them were morally bankrupt, you might just be realizing now.
Just days ago ollama devs claimed[0] that ollama no longer relies on ggml / llama.cpp. here is their pull request(+165,966 −47,980) to reimplement (copy) llama.cpp code in their repository.
https://github.com/ollama/ollama/pull/11823
[0] https://news.ycombinator.com/item?id=44802414#44805396
The PR you linked to says “thanks to the amazing work done by ggml-org” and doesn’t remove GGML code, it instead updates the vendored version and seems to throw away ollama’s custom changes. That’s the opposite of disentangling.
Here’s the maintainer of ggml explaining the context behind this change: https://github.com/ollama/ollama/issues/11714#issuecomment-3...
not against overall sentiment here, but quote the counterpoint from the linked HN comment to be fair:
> Ollama does not use llama.cpp anymore; we do still keep it and occasionally update it to remain compatible for older models for when we used it.
The linked PR is doing "occasionally update it" I guess? Note that "vendored" in the PR title often means to take a snapshot to pin a specific version.
gpt-oss is not an "older model"
ollama is a lost cause. they are going through a very aggressive phase of enshittification right now.
I disagree. Ollama’s reason to be is to make things simple, not to always be on the cutting edge. I use Ollama when I can because of this simplicity. Since I bought a 32G integrated memory Mac 18 months ago, I have run so many models using Ollama, with close to zero problems.
The simple thing to do is to just use the custom quantization that OpenAI used for gpt-oss and use GGUF for other models.
Using Huggingface, LM Studio, etc. is the Linux metaphor of flexibility. Using Ollama is sort of like using macOS
What is the currently favored alternative for simply running 1-2 models locally, exposed via an API? One big advantage of Ollama seems to be that they provide fully configured models, so I don't have to fiddle with stop words, etc.
llama-swap if you need more than 1 model. It wraps llama.cpp, and has a Docker container version that is pretty easy to work with.
I just use llama.cpp server. It works really well. Some people recommend llama-swap or kobold but I never tried them.
ollama is certain in the future rent seeking wrapper by docker fame.
classic Docker Hub playbook: spread the habit for free → capture workflows → charge for scale.
moat isn't in inference speed — it’s in controlling the distribution & default UX, once they own that, can start rent-gate it.
llama.cpp is a mess and ollama is right to move on from it
ollama is still using llama.cpp. they are just denying that they are :)
for folks wrestling with Ollama, llama.cpp or local LLM versioning - did you guys check out Docker's new feature - Docker Model Runner?
Docker Model Runner makes it easy to manage, run, and deploy AI models using Docker. Designed for developers, Docker Model Runner streamlines the process of pulling, running, and serving large language models (LLMs) and other AI models directly from Docker Hub or any OCI-compliant registry.
Whether you're building generative AI applications, experimenting with machine learning workflows, or integrating AI into your software development lifecycle, Docker Model Runner provides a consistent, secure, and efficient way to work with AI models locally.
For more details check this out : https://docs.docker.com/ai/model-runner/
I have a video on using DMR https://youtu.be/3p2uWjFyI1U?si=dazN8rRRdIbAa8Sl