They made this claim in a peer reviewed paper submitted to Nature, but it’s not clear how peers could evaluate the truth of this claim.
If it’s true, and the consensus is that we are hitting limits of how to improve these models, the hypothesis that the entire market is in a bubble over-indexed on GPU costs [0] starts to look more credible.
At the very least, OpenAI and Anthropic look ridiculously inefficient. Mind you, given the numbers on the Oracle deal don’t add up, this is all starting to sound insane already.
Right, but they didn't use rented GPUs, so it's a purely notional figure. It's an appropriate value for comparison to other single training runs (e.g. it tells you that turning DeepSeek-V3 into DeepSeek-R1 cost much less than training DeepSeek-V3 from scratch) but not for the entire budget of a company training LLMs.
DeepSeek spent a large amount upfront to build a cluster that they can run lots of small experiments on over the course of several years. If you only focus on the successful ones, it looks like their costs are much lower than they were end-to-end.
They made this claim in a peer reviewed paper submitted to Nature, but it’s not clear how peers could evaluate the truth of this claim.
If it’s true, and the consensus is that we are hitting limits of how to improve these models, the hypothesis that the entire market is in a bubble over-indexed on GPU costs [0] starts to look more credible.
At the very least, OpenAI and Anthropic look ridiculously inefficient. Mind you, given the numbers on the Oracle deal don’t add up, this is all starting to sound insane already.
[0] https://www.wheresyoured.at/the-haters-gui/
Maybe, if you don't include the >$10m investment in H800 hardware. Still a lot cheaper than competitors though.
Yes, if we include a cost they didn't include, the cost would be different.
More like, if you exclude costs, things cost whatever you want to tell people they cost.
No, their calculation is based on a rental price of $2/hour.
Right, but they didn't use rented GPUs, so it's a purely notional figure. It's an appropriate value for comparison to other single training runs (e.g. it tells you that turning DeepSeek-V3 into DeepSeek-R1 cost much less than training DeepSeek-V3 from scratch) but not for the entire budget of a company training LLMs.
DeepSeek spent a large amount upfront to build a cluster that they can run lots of small experiments on over the course of several years. If you only focus on the successful ones, it looks like their costs are much lower than they were end-to-end.
No, they’re saying training a model, specifically DeepSeek, costs X using N hrs of Y GPU rental.