This was the perfect opportunity to share the evidence. I think undisclosed quantization is definitely a thing. We need benchmarks to be periodically re-evaluated to ward against this.
Providers should keep timestamped models fixed, and assign modified versions a new timestamp, and price, if they want. The model with the "latest" tag could change over time, like a Docker image. Then we can make an informed decision over which version to use. Companies want to cost optimize their cake and eat it too.
edit: I have the same complaint about my Google Home devices. The models they use today are indisputably worse than the ones they used five whole years ago. And features have been removed without notice. Qualitatively, the devices are no longer what I bought.
I guarantee you the weights are already versioned like you're describing. Each training run results in a static bundle of outputs and these are very much pinned (OpenAI has confirmed multiple times that they don't change the model weights once they issue a public release).
> Not quantized. Weights are the same. If we did change the model, we’d release it as a new model with a new name in the API.”
The problem here is that most issues stem from broader infrastructure issues like numerical instability at inference time. Since this affects their whole service pipeline, the logic here can't really be encapsulated in a frozen environment like a Docker container. I suppose _technically_ they could maintain a separate inference cluster for each of their point releases, but that also means that previous models don't benefit from common infrastructure improvements / load balancing would be more difficult to shard across GPUs / might be logistically so hard to coordinate to effectively make it impossible.
Sorry, but this makes no sense. Numerical instability would lead to random fluctuations in output quality, but not to a continuous slow decline like the OP described.
Heard of similar experiences from RL acquaintances, where a prompt worked reliably for hundreds of requests per day for several months - and then suddenly the model started to make mistakes, ignore parts of the prompt, etc when a newer model was released.
I agree, it doesn't have to be deliberate malice like intentionally nerfing a model to make people switch to the newer one - it might just be that less resources are allocated to the older model once the newer one is available and so the inference parameters change - but some effect at the release of a newer model seems to be there.
I'm responding to the parent comment who's suggesting we version control the "model" in Docker. There are infra reasons why companies don't do that. Numerical instability is one class of inference issues, but there can be other bugs in the stack separate from them intentionally changing the weights or switching to a quantized model.
As for the original forum post:
- Multiple numerical computation bugs can compound to make things worse (we saw this in the latest Anthropic post-mortum)
- OP didn't provide any details on eval methodology, so I don't think it's worth speculating on this anecdotal report until we see more data
That's a great point. However, I think we can treat the serving pipeline as part and parcel of the model, for practical purposes. So it is dishonest of companies to say they haven't changed the model while undertaking such cost optimizations that impair the models' effective intelligence.
In addition to quantization, I suspect the additions they make continually to their hidden system prompt for legal, business, and other reasons slowly degrade responses over time as well.
I have a theory: all these people reporting degrading model quality over time aren't actually seeing model quality deteriorate. What they are actually doing is discovering that these models aren't as powerful as they initially thought (ie. expanding their sample size for judging how good the model is). The probabilistic nature of LLM produces a lot of confused thinking about how good a model is, just because a model produces nine excellent responses doesn't mean the tenth response won't be garbage.
They test specific prompts with temperature 0. It is of course possible that all their tests prompts were lucky, but still then, shouldn't you see an immediate drop followed by a flat or increasing line?
Also, from what I understand from the article, it's not a difficult task but an easily machine checkable one, i.e. whether the output conforms to a specific format.
With T=0 on the same model you should get the same exact output text. If they are not getting it, other environmental factors invalidate the test result.
If it was random luck, wouldn't you expect about half the answers to be better? Assuming the OP isn't lying I don't think there's much room for luck when you get all the questions wrong on a T/F test.
TFA is about someone running the same test suite with 0 temperature and fixed inputs and fixtures on the same model over months on end.
What’s missing is the actual evidence. Which I would love of course. But assuming they’re not actively lying, this is not as subjective as you suggest.
Yes exactly, my theory is that the novelty of a new generation of LLMs’ performances tends to cause an inflation in peoples’ perceptions of the model, with a reversion to a better calibrated expectation over time. If the developer reported numerical evaluations that drifted over time, I’d be more convinced of model change.
your theory does not hold up for this specific article as they carefully explained they are sending identical inputs into the model each time and observing progressively worse results with other variables unchanged. (though to be fair, others have noted they provided no replication details as to how they arrived at these results.)
I see your point but no, it's getting objectively worse. I have a similar experience of casually using chatgpt for various use cases, when 5 dropped i noticed it was very fast but oddly got some details off. As time moved on it became both slower and the output deteriorated.
but I use local models and sometimes the same ones for years already, and the consistency and expectations there is noteworthy, while I also have doubts about the quality consistency I have from closed models in the cloud. I don't see these kind of complaints from people using local models, which undermines the idea that people were just wowed three months ago and less impressed now.
so perhaps it's just a matter of transparency
but I think there is consistent fine tuning occuring, alongside filters added and removed in an opaque way in front of the model
I know that OpenAI has made computing deals with other companies, and as time goes on, the percentage of inference that they run their models on will shift, but I doubt that much, if any, of that has moved from Microsoft Azure data centers yet, so that's not a reason for difference in model performance.
With that said, Microsoft has a different level of responsibility, both to its customers and to its stakeholders, to provide safety than OpenAI or any other frontier provider. That's not a criticism of OpenAI or Anthropic or anyone else, who I believe are all trying their best to provide safe usage. (Well, other than xAI and Grok, for which the lack of safety is a feature, not a bug.)
The risk to Microsoft of getting this wrong is simply higher than it is for other companies, and what's why they have a strong focus on Responsible AI (RAI) [1]. I don't know the details, but I have to assume there's a layer of RAI processing on models through Azure OpenAI that's not there for just using OpenAI models directly through the OpenAI API. That layer is valuable for the companies who choose to run their inference through Azure, who also want to maximize safety.
I wonder if that's where some of the observed changes are coming from. I hope the commenter posts their proof for further inspection. It would help everyone.
They most definitely do. They have been lobotomized in some way to be ultra corporate friendly. I can only use their M365 Copilot at work and it's absolute dogshit at writing code more than maybe 100 lines. It can barely write correct PowerShell. Luckily, I really only need it for quick and dirty short PS scripts.
I've been using Azure AI Foundry for an ongoing project, and have been extremely dissatisfied.
The first issue I ran into was with them not supporting LLaMA for tool calls. Microsoft stated in February that they were working on it [0], and they were just closing the ticket because they were tracking it internally. I'm not sure why they've been unable to do what took me two hours in over six months, but I am sure they wouldn't be upset by me using the much more expensive OpenAI models.
There are also consistent performance issues, even on small models, as mentioned elsewhere. This is with a rate on the order of one per minute. You can solve that with provisioned throughput units. The cheapest option is one of the GPT models, at a minimum of $10k/month (a bit under half the cost of just renting an A100 server). DeepSeek was a minimum of around $72k/month. I don't remember there being any other non-OpenAI models with a provisioned option.
Given that current usage without provisioning is approximately in the single dollars per month, I have some doubts as to whether we'd be getting our money's worth having to provision capacity.
Even with temperature 0, the LLM output will not be deterministic. It will just have less randomness (not defined precisely) than with temperature 1. There was a recent post on the frontpage about fully deterministic sampling, but it turns out to be quite difficult.
I don't think it's a valid measure across models but, as in the OP, it's a great measure for when they mess with "the same model" behind the scenes.
That being said we also do keep a test suite to check that model updates don't result in worse results for our users and it worked well enough. We had to skip a few versions of Sonnet because it stopped being able to complete tasks (on the same data) it could previously. I don't blame Anthropic, I would be crazy to assume that new models are a strict improvement across all tasks and domains.
I do just wish they would stop depreciating old models, once you have something working to your satisfaction it would be nice to freeze it. Ah well, only for local models.
I've noticed this with Claude Code recently. A few weeks ago, Claude was "amazing" in that I could feed it some context and a specification, and it could generate mostly correct code and refine it in a few prompts.
Now, I can try the same things, and Claude gets it terribly wrong and works itself into problems it can't find its way out of.
The cynical side of me thinks this is being done on purpose, not to save Anthropic money, but to make more money by burning tokens.
This brings up a point many will not be aware of. If you know the random seed and the prompt, and the hash of the model's binary file; the output is completely deterministic. You can use this information to check whether they are in fact swapping your requests out to cheaper models than what you're paying for. This level of auditability is a strong argument for using open-source, commodified models, because you can easily check if the vendor is ripping you off.
What's the conversation that you're looking to have here? There are fairly widespread claims that GPT-5 is worse than 4, and that's what the help article you've linked to says. I'm not sure how this furthers dialog about or understanding of LLMs, though, it reads to _me_ like this question just reinforces a notion that lots of people already agree with.
What's your aim here, sgt3v? I'd love to positively contribute, but I don't see how this link gets us anywhere.
Could it be a result of a caching of some sort? I suppose in case of LLM they can't make a direct cache but they could group prompts using embeddings and produce some most common result maybe? (this is just a theory)
I’m convinced all of the major LLM providers silently quantize their models. The absolute worst was Google’s transition from Gemini 2.5 Pro 3-25 checkpoint to the May checkpoint, but I’ve noticed this effect with Claude and GPT over the years too.
I couldn’t imagine relying on any closed models for a business because of this highly dishonest and deceptive practice.
You can be clever with language also. You can say “we never intentionally degrade model performance” and then claim you had no idea a quant would make perf worse because it was meant to make it better (faster).
LLM are just software + data and can be made deterministic, in the same way a pseudo random number generator can be made deterministic by using the same seed. For an LLM, you typically set temperature to 0, or set the random seed to the same value, run it on the same hardware (or emulation) or otherwise ensure the (floating point) calculations get the exact same results. I think that's it. In reality, yes it's not that easy, but it's possible.
Unfortunately because floating point addition isn’t always associative, and because GPUs don’t always perform calculations in the same order you won’t always get the same result even with a temperature of zero.
I used to think running your own local model is silly because it’s slow and expensive, but the nerfing of ChatGPT and Gemini is so aggressive it’s starting to make a lot more sense. I want the smartest model, and I don’t want to second guess some potentially quantized black box.
Am I the only person who can sense the exact moment an LLM-written response kicked in? :) "sharing some of the test results/numbers you have would truly help cement this case!" - c'mon :)
I think it would have sounded more reasonable in French, which is my actual native tongue. (i.e. I subconsciously translate from French when I'm writing in English)
((this comment was also written without AI!!)) :-)
This was the perfect opportunity to share the evidence. I think undisclosed quantization is definitely a thing. We need benchmarks to be periodically re-evaluated to ward against this.
Providers should keep timestamped models fixed, and assign modified versions a new timestamp, and price, if they want. The model with the "latest" tag could change over time, like a Docker image. Then we can make an informed decision over which version to use. Companies want to cost optimize their cake and eat it too.
edit: I have the same complaint about my Google Home devices. The models they use today are indisputably worse than the ones they used five whole years ago. And features have been removed without notice. Qualitatively, the devices are no longer what I bought.
I commented on the forum asking Sarge whether they could share some of their test results.
If they do, I think that it will add a lot to this conversation. Hope it happens!
I guarantee you the weights are already versioned like you're describing. Each training run results in a static bundle of outputs and these are very much pinned (OpenAI has confirmed multiple times that they don't change the model weights once they issue a public release).
> Not quantized. Weights are the same. If we did change the model, we’d release it as a new model with a new name in the API.”
- [Ted Sanders](https://news.ycombinator.com/item?id=44242198) (OpenAI)
The problem here is that most issues stem from broader infrastructure issues like numerical instability at inference time. Since this affects their whole service pipeline, the logic here can't really be encapsulated in a frozen environment like a Docker container. I suppose _technically_ they could maintain a separate inference cluster for each of their point releases, but that also means that previous models don't benefit from common infrastructure improvements / load balancing would be more difficult to shard across GPUs / might be logistically so hard to coordinate to effectively make it impossible.
https://www.anthropic.com/engineering/a-postmortem-of-three-... https://thinkingmachines.ai/blog/defeating-nondeterminism-in...
Sorry, but this makes no sense. Numerical instability would lead to random fluctuations in output quality, but not to a continuous slow decline like the OP described.
Heard of similar experiences from RL acquaintances, where a prompt worked reliably for hundreds of requests per day for several months - and then suddenly the model started to make mistakes, ignore parts of the prompt, etc when a newer model was released.
I agree, it doesn't have to be deliberate malice like intentionally nerfing a model to make people switch to the newer one - it might just be that less resources are allocated to the older model once the newer one is available and so the inference parameters change - but some effect at the release of a newer model seems to be there.
I'm responding to the parent comment who's suggesting we version control the "model" in Docker. There are infra reasons why companies don't do that. Numerical instability is one class of inference issues, but there can be other bugs in the stack separate from them intentionally changing the weights or switching to a quantized model.
As for the original forum post:
- Multiple numerical computation bugs can compound to make things worse (we saw this in the latest Anthropic post-mortum)
- OP didn't provide any details on eval methodology, so I don't think it's worth speculating on this anecdotal report until we see more data
That's a great point. However, I think we can treat the serving pipeline as part and parcel of the model, for practical purposes. So it is dishonest of companies to say they haven't changed the model while undertaking such cost optimizations that impair the models' effective intelligence.
In addition to quantization, I suspect the additions they make continually to their hidden system prompt for legal, business, and other reasons slowly degrade responses over time as well.
This is quite similar to all the modifications intel had to do due to spectre - I bet those system prompts have grown exponentially.
I have a theory: all these people reporting degrading model quality over time aren't actually seeing model quality deteriorate. What they are actually doing is discovering that these models aren't as powerful as they initially thought (ie. expanding their sample size for judging how good the model is). The probabilistic nature of LLM produces a lot of confused thinking about how good a model is, just because a model produces nine excellent responses doesn't mean the tenth response won't be garbage.
They test specific prompts with temperature 0. It is of course possible that all their tests prompts were lucky, but still then, shouldn't you see an immediate drop followed by a flat or increasing line?
Also, from what I understand from the article, it's not a difficult task but an easily machine checkable one, i.e. whether the output conforms to a specific format.
With T=0 on the same model you should get the same exact output text. If they are not getting it, other environmental factors invalidate the test result.
If it was random luck, wouldn't you expect about half the answers to be better? Assuming the OP isn't lying I don't think there's much room for luck when you get all the questions wrong on a T/F test.
TFA is about someone running the same test suite with 0 temperature and fixed inputs and fixtures on the same model over months on end.
What’s missing is the actual evidence. Which I would love of course. But assuming they’re not actively lying, this is not as subjective as you suggest.
Yes exactly, my theory is that the novelty of a new generation of LLMs’ performances tends to cause an inflation in peoples’ perceptions of the model, with a reversion to a better calibrated expectation over time. If the developer reported numerical evaluations that drifted over time, I’d be more convinced of model change.
your theory does not hold up for this specific article as they carefully explained they are sending identical inputs into the model each time and observing progressively worse results with other variables unchanged. (though to be fair, others have noted they provided no replication details as to how they arrived at these results.)
I see your point but no, it's getting objectively worse. I have a similar experience of casually using chatgpt for various use cases, when 5 dropped i noticed it was very fast but oddly got some details off. As time moved on it became both slower and the output deteriorated.
fta: “I am glad I have proof of this with the test system”
I think they have receipts, but did not post them there
A lot of the claims I’ve seen have claimed to have proof, but details are never shared.
Even a simple graph of the output would be better than nothing, but instead it’s just an empty claim.
That's been my experience too
but I use local models and sometimes the same ones for years already, and the consistency and expectations there is noteworthy, while I also have doubts about the quality consistency I have from closed models in the cloud. I don't see these kind of complaints from people using local models, which undermines the idea that people were just wowed three months ago and less impressed now.
so perhaps it's just a matter of transparency
but I think there is consistent fine tuning occuring, alongside filters added and removed in an opaque way in front of the model
Did any of you read the article? They have a test framework that objectively shows the model getting worse over time.
I read the article. No proof was included. Not even a graph of declining results.
I'm confused why this is addressed to Azure instead of OpenAI. Isn't Azure just offering a wrapper around chatGPT?
That said, I would also love to see some examples or data, instead of just "it's getting worse".
I know that OpenAI has made computing deals with other companies, and as time goes on, the percentage of inference that they run their models on will shift, but I doubt that much, if any, of that has moved from Microsoft Azure data centers yet, so that's not a reason for difference in model performance.
With that said, Microsoft has a different level of responsibility, both to its customers and to its stakeholders, to provide safety than OpenAI or any other frontier provider. That's not a criticism of OpenAI or Anthropic or anyone else, who I believe are all trying their best to provide safe usage. (Well, other than xAI and Grok, for which the lack of safety is a feature, not a bug.)
The risk to Microsoft of getting this wrong is simply higher than it is for other companies, and what's why they have a strong focus on Responsible AI (RAI) [1]. I don't know the details, but I have to assume there's a layer of RAI processing on models through Azure OpenAI that's not there for just using OpenAI models directly through the OpenAI API. That layer is valuable for the companies who choose to run their inference through Azure, who also want to maximize safety.
I wonder if that's where some of the observed changes are coming from. I hope the commenter posts their proof for further inspection. It would help everyone.
[1]: https://www.microsoft.com/en-us/ai/responsible-ai
I don't remember where I saw it, but I remember a claim that Azure hosted models performed poorer than those hosted by openAI.
Explains why the enterprise copilot ChatGPT wrapper that they shoehorn into every piece of office365 performs worse than a badly configured local LLM.
They most definitely do. They have been lobotomized in some way to be ultra corporate friendly. I can only use their M365 Copilot at work and it's absolute dogshit at writing code more than maybe 100 lines. It can barely write correct PowerShell. Luckily, I really only need it for quick and dirty short PS scripts.
I've been using Azure AI Foundry for an ongoing project, and have been extremely dissatisfied.
The first issue I ran into was with them not supporting LLaMA for tool calls. Microsoft stated in February that they were working on it [0], and they were just closing the ticket because they were tracking it internally. I'm not sure why they've been unable to do what took me two hours in over six months, but I am sure they wouldn't be upset by me using the much more expensive OpenAI models.
There are also consistent performance issues, even on small models, as mentioned elsewhere. This is with a rate on the order of one per minute. You can solve that with provisioned throughput units. The cheapest option is one of the GPT models, at a minimum of $10k/month (a bit under half the cost of just renting an A100 server). DeepSeek was a minimum of around $72k/month. I don't remember there being any other non-OpenAI models with a provisioned option.
Given that current usage without provisioning is approximately in the single dollars per month, I have some doubts as to whether we'd be getting our money's worth having to provision capacity.
Is setting temperature to 0 even a valid way to measure LLM performance over time, all else equal?
Even with temperature 0, the LLM output will not be deterministic. It will just have less randomness (not defined precisely) than with temperature 1. There was a recent post on the frontpage about fully deterministic sampling, but it turns out to be quite difficult.
It's because batch size is dynamic. So a different batch size will change the output even on temp 0.
It could be that performance on temp zero has declined but performance on a normal temp is the same or better.
I wonder if temp zero would be more influenced by changes to the system prompt too. I can imagine it making responses more brittle.
I don't think it's a valid measure across models but, as in the OP, it's a great measure for when they mess with "the same model" behind the scenes.
That being said we also do keep a test suite to check that model updates don't result in worse results for our users and it worked well enough. We had to skip a few versions of Sonnet because it stopped being able to complete tasks (on the same data) it could previously. I don't blame Anthropic, I would be crazy to assume that new models are a strict improvement across all tasks and domains.
I do just wish they would stop depreciating old models, once you have something working to your satisfaction it would be nice to freeze it. Ah well, only for local models.
I'd have assumed a fixed seed was used, but he doesn't mention that. Weird. Maybe he meant that?
Pure sci-Fi idea: what if actually nothing was changed, but RNGs were becoming less random as we extract more randomness out of the universe?
I bet they did both. If I'm reading the documentation right you have to supply a seed in order to get "best effort" determinism.
https://learn.microsoft.com/en-us/azure/ai-foundry/openai/re...
I've noticed this with Claude Code recently. A few weeks ago, Claude was "amazing" in that I could feed it some context and a specification, and it could generate mostly correct code and refine it in a few prompts.
Now, I can try the same things, and Claude gets it terribly wrong and works itself into problems it can't find its way out of.
The cynical side of me thinks this is being done on purpose, not to save Anthropic money, but to make more money by burning tokens.
This brings up a point many will not be aware of. If you know the random seed and the prompt, and the hash of the model's binary file; the output is completely deterministic. You can use this information to check whether they are in fact swapping your requests out to cheaper models than what you're paying for. This level of auditability is a strong argument for using open-source, commodified models, because you can easily check if the vendor is ripping you off.
Pretty sure this is wrong, requests are batched and size can affect the output, also gpus are highly parallel, there can be many race conditions.
Yup. Floating point math turns race conditions into numerical errors, reintroducing non-determinism regardless of inputs used.
What's the conversation that you're looking to have here? There are fairly widespread claims that GPT-5 is worse than 4, and that's what the help article you've linked to says. I'm not sure how this furthers dialog about or understanding of LLMs, though, it reads to _me_ like this question just reinforces a notion that lots of people already agree with.
What's your aim here, sgt3v? I'd love to positively contribute, but I don't see how this link gets us anywhere.
Maybe to prompt more anecdotes on how gpt-$ is the money making gpt—where they gut quality and hold prices steady to reduce losses?
I can tell you that the post describes is exactly what I’ve seen also: degraded performance and excruciatingly slow.
Could it be a result of a caching of some sort? I suppose in case of LLM they can't make a direct cache but they could group prompts using embeddings and produce some most common result maybe? (this is just a theory)
At least on OpenRouter, you can often verify what quant a provider is using for a particular model.
This is why we have open source. I noticed this with cursor, it’s not just an azure problem.
I’m convinced all of the major LLM providers silently quantize their models. The absolute worst was Google’s transition from Gemini 2.5 Pro 3-25 checkpoint to the May checkpoint, but I’ve noticed this effect with Claude and GPT over the years too.
I couldn’t imagine relying on any closed models for a business because of this highly dishonest and deceptive practice.
You can be clever with language also. You can say “we never intentionally degrade model performance” and then claim you had no idea a quant would make perf worse because it was meant to make it better (faster).
It’s a good thing the author provided no data or examples. Otherwise, there might be something to actually talk about.
Since when LLM become deterministic?
LLM are just software + data and can be made deterministic, in the same way a pseudo random number generator can be made deterministic by using the same seed. For an LLM, you typically set temperature to 0, or set the random seed to the same value, run it on the same hardware (or emulation) or otherwise ensure the (floating point) calculations get the exact same results. I think that's it. In reality, yes it's not that easy, but it's possible.
Unfortunately because floating point addition isn’t always associative, and because GPUs don’t always perform calculations in the same order you won’t always get the same result even with a temperature of zero.
I used to think running your own local model is silly because it’s slow and expensive, but the nerfing of ChatGPT and Gemini is so aggressive it’s starting to make a lot more sense. I want the smartest model, and I don’t want to second guess some potentially quantized black box.
I'm sure MSFT will offer this person some upgraded API tier that somewhat improves the issues, though not terrifically, for only ten times the price.
Am I the only person who can sense the exact moment an LLM-written response kicked in? :) "sharing some of the test results/numbers you have would truly help cement this case!" - c'mon :)
I actually 100% wrote that comment myself haha!! See https://news.ycombinator.com/item?id=45316437
I think it would have sounded more reasonable in French, which is my actual native tongue. (i.e. I subconsciously translate from French when I'm writing in English)
((this comment was also written without AI!!)) :-)