Doesn’t surprise me at all that people who know what they’re doing are building their own images with nix for ML. Tens of millions of dollars have been wasted in the past 2 years by teams who are too stupid to upgrade from buggy software bundled into their “golden” docker container, or too slow to upgrade their drivers/kernels/toolkits. It’s such a shame. It’s not that hard.
The modern ML cards are much more raw than people realize. This isn’t a highly mature ecosystem with stable software, there are horrible bugs. It’s gotten better, but there are still big problems, and the biggest problem is that so many people are too stupid to use new releases with the fixes. They stick to the versions they already have because of nothing other than superstition.
Go look at the llama 3 whitepaper and look at how frequently their training jobs died and needed to be restarted. Quoting:
> During a 54-day snapshot period of pre-training, we experienced a total of 466 job interruptions. Of these, 47 were planned interruptions due to automated maintenance operations such as firmware upgrades or operator-initiated operations like configuration or dataset updates. The remaining 419 were unexpected interruptions, which are classified in Table 5. Approximately 78% of the unexpected interruptions are attributed to confirmed hardware issues, such as GPU or host component failures, or suspected hardware-related issues like silent data corruption and unplanned individual host maintenance events.
[edit: to be clear, this is not meant to be a dig on the meta training team. They probably know what they’re doing. Rather, it’s meant to give an idea of how immature the nvidia ecosystem was when they trained llama 3 in early 2024. This is the kind of failure rates you can expect if you opt into using the same outdated software they were forced to use at the time.]
The firmware and driver quality is not what people think it is. There’s also a lot of low-level software like NCCL and the toolkit that exacerbates issues in specific drivers and firmware versions. Grep for “workaround” in the NCCL source code and you’ll see some of them. It’s impossible to validate and test all permutations. It’s also worth mentioning that the drivers interact with a lot of other kernel subsystems. I’d point to HMM, the heterogeneous memory manager, which is hugely important for nvidia-uvm, which was only introduced in v6.1 and sees a lot of activity.
Or go look at the amount of patches being applied to mlx5. Not all of those patches get back ported into stable trees. God help you if your network stack uses an out of tree driver.
When you’re responsible for supporting people who refuse to receive patches like this one [1], and those same people have the power to page your phone at 11pm on the weekend… you quickly learn how to call a spade, a spade.
There is undoubtedly a better word than stupid. They're very likely not stupid. Careless, maybe. Inept, maybe. Irresponsible, maybe. Stubborn, maybe. More generously: overworked. Just probably not stupid.
That wasn't what he used the word for. I understood his point perfectly: there are AI teams that are not knowledgeable or skilled enough to modify and enhance the docker images or toolkits that train/run the models. It takes some medium to advanced skills to get drivers to work properly. He used shorthand "too stupid to" instead of what I wrote above.
Saying that Docker could be replaced by a simple script that does chroot + ufw + nsenter is like saying that Dropbox could be a simple script using rsync and cron. That is, technically not wrong, but it completely misses the UX / product perspective.
Honestly I've been loving systemd-nspawn using mkosi to build containers, distroless ones too at that where sensible. Works a treat for building vms too.
Scales wonderfully, fine grained permissions and configuration are exactly how you'd hope coming from systemd services. I appreciate it leverages various linux-isms like btrfs snapshots for faster read only or ephemeral containers.
People still by large have this weird assumption that you can only do OS containers with nspawn, never too sure where that idea came from.
Deploying computer programs isn't that hard. What you actually need to run is pretty straight forward. Depend on glibc, copypaste all your other shared lib dependencies and plop them in RPATH. Pretend `/lib` is locked at initial install. Remove `/usr/lib` from the path and include everything.
Docker was made because Linux sucks at running computer programs. Which is a very silly thing to be bad at. But here we are.
What has happened in more recent years is that CMake sucks ass so people have been abusing Docker and now Nix as build system. Blech.
The speaker does get it right at the end. A Bazel/Buck2 type solution is correct. An actual build system. They're abusing Nix and adding more layers to provide better caching. Sure, I guess.
If you step back and look at the final output of what needs to be produced and deployed it's not all that complicated. Especially if you get rid of the super complicated package dependency graph resolution step and simply vendor the versions of the packages you want. Which everyone should do, and a company like Anthropic should definitely do.
You contrast Nix with Bazel, and liken it to Docker, tells me you don't have a great grasp of what Nix is. It is far more similar to Bazel than it is to Docker.
> Depend on glibc, copypaste all your other shared lib dependencies and plop them in RPATH. Pretend `/lib` is locked at initial install. Remove `/usr/lib` from the path and include everything.
You are not describing relocatable builds at all. You are describing... well, it kinda sounds like how Nix handles RPATH.
Maybe. Ask 10 devs "what is Nix" and you'll get 15 to 25 responses. Maybe more. Nix is a million different things.
There are certainly things within Nix that I like. But on the whole I think it's approximately two orders of magnitude more complicated than is necessary to efficiently build and deploy software.
Nix is 4 or 5 different things. I agree that the term is unfortunately quite overloaded.
> But on the whole I think it's approximately two orders of magnitude more complicated than is necessary to efficiently build and deploy software.
This might be true, if you dramatically constrain the problem space to something like "only build and deploy static Go binaries". If you have that luxury, by all means, simplify your life!
But in the general case, it is an inherently complex problem space, and tools that attempt to rise to the challenge, like Bazel or Nix, will be lot more complex than a Dockerfile.
My core hypothesis - which is maybe wrong - is that a good Bazel-like doesn't have to be that complex.
I use Buck2 in my day job. For almost all projects its an order of magnitude simpler than CMake. It's got a ton of cruft to support a decades worth of targets that were made badly made. But the overall shape is actually pretty darn good. I think Starlark (and NixLang) are huge mistakes. Thou shalt not mix data and logic. Thou shalt not ever ever ever force users to use a language that doesn't have a great debugger.
Build systems aren't actually that complicated. It's usually super duper trivial for me to write down the EXACT command I want to execute. It's "just" a matter of inducing a wobbly rube goldberg machine that can't be debugged to invoke the god damn command I know I want. Grumble grumble.
> simply vendor the versions of the packages you want
That's really not how frontier research works.
The packages they want are nightlies with lots of necessary fresh fixes that their team probably even contributed to, and waiting for Red Hat to include it in the next distro is completely out of the question.
There is a wealth of options between latest_master -> nightly -> ..... -> RedHat update.
And there's only a very small number of specific libraries that you'd even want to consider for nightly. Majority of repo should absolutely be pinned and infrequently updated.
There have been so many vendor supply chain exploits on HN in 2025 that I'd consider it borderline malpractice to use packages less than a month old in almost all cases. Certainly by default.
> Docker was made because Linux sucks at running computer programs.
I'm not sure that's why Docker was made.
I'm pretty sure Linux is not-suck at running programs; it does run quite a lot of them. Might even be most of them? All those servers and phones and stuff.
Nah. Docker was created to solve the "works on my machine" problem. Because Linux made the wrong choice of having a global pool of shared libraries. So Docker hacked around this by containerizing a full copy of the desired environment.
What Docker has done is kinda right. Programs should include all of their dependencies! But it sucks that you have to fight really hard and go out of your way to work around Linux's original choices.
Now you may be thinking wait a second Linux was right! It's better to have a single copy of shared libraries so they can be securely updated just once instead of forcing every program to release a new update! Theoretically sure. But since everyone uses Docker you have to (slowly and painfully) rebuild the image and so that supposed benefit is moot. Womp womp.
I think the idea of shared libraries is also linked to a problem of the past: expensive storage (especially fast storage).
Nowadays SSDs with decent random IO are quite cheap to the point where even low-end hosting has them, and getting spinning disks is a choice reserved for large media files.
On the consumer side, we are below a hundred dollars for one TB, so the storage savings are not very relevant.
But if you go back to when Linux was designed, storage was painfully expensive, and potentially saving even a few hundred megabytes was pretty good.
But I do agree that shared libraries are generally a bad idea and something that will most likely create problems at some point. Self-contained software makes a lot more sense generally speaking. And I definitely think that Docker is a dumb "solution" for software distribution but the problem really starts with devs using way too many dependencies.
One thing I see is that folk who are making software for Linux target Distribution rather than Linux generic. It's because of the SO "problem".
It's a benefit of the tight coupling of Windows or Mac (which I don't use).
Full agree that security updates in the Docker world is like the problem from static builds.
Disclosure: I ship some stuff on Linux; for the problems we do static & docker. The demand side also seems to favour docker. I also prefer the docker method; for the compatibility reasons.
Strong agree on the “target distro” or even “target specific env” versus Linux.
I think I disagree on Windows being tightly coupled. Windows simply has a barren global library path. Programs merely included their extra DLLs adjacent to the exe. Very simple and reliable.
Linux has added complexity in that glibc sucks and is horribly architected. GCC/Clang are bad and depend on the local environment waaaay too much. Linux ecosystem is very much not built to support sharing cross compiled binaries. It’s so painful. :(
It's removing complexity elsewhere, usually much more. Once you have invested in a relatively fixed bit of complexity, your other tasks become much easier to complete.
Once you have invested in understanding the Clifford algebra, your whole classical electrodynamics turns from 30 equations into two.
Once you have invested in writing a Fortran compiler, you can write numerical computations much easier and shorter than in assembly.
Once you have invested in learning Nix, your whole elaborate build infra, plus chef / puppet / saltstack suddenly shrinks to a few declarations.
> Once you have invested in learning Nix, your whole elaborate build infra, plus chef / puppet / saltstack suddenly shrinks to a few declarations.
Your analogy breaks down with Nix, since learning and using it is a hostile experience, unlike (I assume) your other examples.
I have been using NixOS for about 5 years now on several machines, and I still don't know what I'm doing. Troubleshooting errors and implementing features is like climbing a mountain in the dark.
The language syntax is alien. Most functionality is unintuitive. The errors are cryptic. The documentation ranges from poor to nonexistent. It tries to replace every package manager in existence. The community is unwelcoming.
Guix addresses some of these issues, and at least it uses a sane language, but the momentum is, unfortunately, largely with Nix.
Nix pioneered many important concepts that most operating systems and package managers should have. But the implementation and execution of those concepts leaves a lot to be desired.
So, sure, if you and your team have patience to deal with all of its shortcomings, I can see how it can be useful. But personally, I would never propose using it in a professional setting, and would rather use established and more "complex" tooling most engineers are familiar with.
Doesn’t surprise me at all that people who know what they’re doing are building their own images with nix for ML. Tens of millions of dollars have been wasted in the past 2 years by teams who are too stupid to upgrade from buggy software bundled into their “golden” docker container, or too slow to upgrade their drivers/kernels/toolkits. It’s such a shame. It’s not that hard.
Edit: see also the horrors that exist when you mix nvidia software versions: https://developer.nvidia.com/blog/cuda-c-compiler-updates-im...
I use Nix and like it, but in terms of DX docker is still miles ahead. I liken it to Python vs Rust. Use the right tool.
Can you be explicit in how the dollars are being wasted? Maybe it's obvious to you but omjow does an old kernel waste money?
The modern ML cards are much more raw than people realize. This isn’t a highly mature ecosystem with stable software, there are horrible bugs. It’s gotten better, but there are still big problems, and the biggest problem is that so many people are too stupid to use new releases with the fixes. They stick to the versions they already have because of nothing other than superstition.
Go look at the llama 3 whitepaper and look at how frequently their training jobs died and needed to be restarted. Quoting:
> During a 54-day snapshot period of pre-training, we experienced a total of 466 job interruptions. Of these, 47 were planned interruptions due to automated maintenance operations such as firmware upgrades or operator-initiated operations like configuration or dataset updates. The remaining 419 were unexpected interruptions, which are classified in Table 5. Approximately 78% of the unexpected interruptions are attributed to confirmed hardware issues, such as GPU or host component failures, or suspected hardware-related issues like silent data corruption and unplanned individual host maintenance events.
[edit: to be clear, this is not meant to be a dig on the meta training team. They probably know what they’re doing. Rather, it’s meant to give an idea of how immature the nvidia ecosystem was when they trained llama 3 in early 2024. This is the kind of failure rates you can expect if you opt into using the same outdated software they were forced to use at the time.]
The firmware and driver quality is not what people think it is. There’s also a lot of low-level software like NCCL and the toolkit that exacerbates issues in specific drivers and firmware versions. Grep for “workaround” in the NCCL source code and you’ll see some of them. It’s impossible to validate and test all permutations. It’s also worth mentioning that the drivers interact with a lot of other kernel subsystems. I’d point to HMM, the heterogeneous memory manager, which is hugely important for nvidia-uvm, which was only introduced in v6.1 and sees a lot of activity.
Or go look at the amount of patches being applied to mlx5. Not all of those patches get back ported into stable trees. God help you if your network stack uses an out of tree driver.
It always cracks me up when people use the word "stupid" to insult other's intelligence. What a pathetically low-effort word to use.
When you’re responsible for supporting people who refuse to receive patches like this one [1], and those same people have the power to page your phone at 11pm on the weekend… you quickly learn how to call a spade, a spade.
[1]: https://patchwork.ozlabs.org/project/ubuntu-kernel/patch/202...
There is undoubtedly a better word than stupid. They're very likely not stupid. Careless, maybe. Inept, maybe. Irresponsible, maybe. Stubborn, maybe. More generously: overworked. Just probably not stupid.
What is the material difference here in the difference between inept and stupid?
A dictionary is an easy way to find out, but in the interest of good faith: stupid is a lack of intelligence, inept is a lack of skills.
To the point: I'd argue ineptitude is both more damning and accurate than stupidity in this particular case.
That wasn't what he used the word for. I understood his point perfectly: there are AI teams that are not knowledgeable or skilled enough to modify and enhance the docker images or toolkits that train/run the models. It takes some medium to advanced skills to get drivers to work properly. He used shorthand "too stupid to" instead of what I wrote above.
I fully understand. My issue is not with the point, my issue is being too lazy to articulate the point, and instead just saying "stupid."
Address the behavior, not the people.
New corollary: sometimes new tech gets built because you don't know how to correctly use existing tech.
Are you referring to this Nix effort or to Docker? Because that largely applies to most usages of Docker.
Saying that Docker could be replaced by a simple script that does chroot + ufw + nsenter is like saying that Dropbox could be a simple script using rsync and cron. That is, technically not wrong, but it completely misses the UX / product perspective.
Saying that nix is a simple script that does chroot + ufw + nsenter is missing the point even more
Great, can't wait for the systemd crew come out with: Docker was Too Slow, So We Replaced It: Systemd in Production [asciinema]
No joke, it's already there, systemd-nspawn can run OCI containers.
Honestly I've been loving systemd-nspawn using mkosi to build containers, distroless ones too at that where sensible. Works a treat for building vms too.
Scales wonderfully, fine grained permissions and configuration are exactly how you'd hope coming from systemd services. I appreciate it leverages various linux-isms like btrfs snapshots for faster read only or ephemeral containers.
People still by large have this weird assumption that you can only do OS containers with nspawn, never too sure where that idea came from.
Building VMs?
funnily enough, I stopped using Docker and use NixOS-configured systemd services half a decade ago and never looked back
What does systemd have to do with the video?
Alternative: just produce relocatable builds that don’t require all of this unnecessary extra infrastructure
Please elaborate. How does one "just" do that?
Deploying computer programs isn't that hard. What you actually need to run is pretty straight forward. Depend on glibc, copypaste all your other shared lib dependencies and plop them in RPATH. Pretend `/lib` is locked at initial install. Remove `/usr/lib` from the path and include everything.
Docker was made because Linux sucks at running computer programs. Which is a very silly thing to be bad at. But here we are.
What has happened in more recent years is that CMake sucks ass so people have been abusing Docker and now Nix as build system. Blech.
The speaker does get it right at the end. A Bazel/Buck2 type solution is correct. An actual build system. They're abusing Nix and adding more layers to provide better caching. Sure, I guess.
If you step back and look at the final output of what needs to be produced and deployed it's not all that complicated. Especially if you get rid of the super complicated package dependency graph resolution step and simply vendor the versions of the packages you want. Which everyone should do, and a company like Anthropic should definitely do.
You contrast Nix with Bazel, and liken it to Docker, tells me you don't have a great grasp of what Nix is. It is far more similar to Bazel than it is to Docker.
> Depend on glibc, copypaste all your other shared lib dependencies and plop them in RPATH. Pretend `/lib` is locked at initial install. Remove `/usr/lib` from the path and include everything.
You are not describing relocatable builds at all. You are describing... well, it kinda sounds like how Nix handles RPATH.
Maybe. Ask 10 devs "what is Nix" and you'll get 15 to 25 responses. Maybe more. Nix is a million different things.
There are certainly things within Nix that I like. But on the whole I think it's approximately two orders of magnitude more complicated than is necessary to efficiently build and deploy software.
Nix is 4 or 5 different things. I agree that the term is unfortunately quite overloaded.
> But on the whole I think it's approximately two orders of magnitude more complicated than is necessary to efficiently build and deploy software.
This might be true, if you dramatically constrain the problem space to something like "only build and deploy static Go binaries". If you have that luxury, by all means, simplify your life!
But in the general case, it is an inherently complex problem space, and tools that attempt to rise to the challenge, like Bazel or Nix, will be lot more complex than a Dockerfile.
My core hypothesis - which is maybe wrong - is that a good Bazel-like doesn't have to be that complex.
I use Buck2 in my day job. For almost all projects its an order of magnitude simpler than CMake. It's got a ton of cruft to support a decades worth of targets that were made badly made. But the overall shape is actually pretty darn good. I think Starlark (and NixLang) are huge mistakes. Thou shalt not mix data and logic. Thou shalt not ever ever ever force users to use a language that doesn't have a great debugger.
Build systems aren't actually that complicated. It's usually super duper trivial for me to write down the EXACT command I want to execute. It's "just" a matter of inducing a wobbly rube goldberg machine that can't be debugged to invoke the god damn command I know I want. Grumble grumble.
> simply vendor the versions of the packages you want
That's really not how frontier research works.
The packages they want are nightlies with lots of necessary fresh fixes that their team probably even contributed to, and waiting for Red Hat to include it in the next distro is completely out of the question.
Objectively false.
There is a wealth of options between latest_master -> nightly -> ..... -> RedHat update.
And there's only a very small number of specific libraries that you'd even want to consider for nightly. Majority of repo should absolutely be pinned and infrequently updated.
There have been so many vendor supply chain exploits on HN in 2025 that I'd consider it borderline malpractice to use packages less than a month old in almost all cases. Certainly by default.
> Docker was made because Linux sucks at running computer programs.
I'm not sure that's why Docker was made.
I'm pretty sure Linux is not-suck at running programs; it does run quite a lot of them. Might even be most of them? All those servers and phones and stuff.
Nah. Docker was created to solve the "works on my machine" problem. Because Linux made the wrong choice of having a global pool of shared libraries. So Docker hacked around this by containerizing a full copy of the desired environment.
What Docker has done is kinda right. Programs should include all of their dependencies! But it sucks that you have to fight really hard and go out of your way to work around Linux's original choices.
Now you may be thinking wait a second Linux was right! It's better to have a single copy of shared libraries so they can be securely updated just once instead of forcing every program to release a new update! Theoretically sure. But since everyone uses Docker you have to (slowly and painfully) rebuild the image and so that supposed benefit is moot. Womp womp.
Additional reading if you are so inclined: https://jangafx.com/insights/linux-binary-compatibility
I think the idea of shared libraries is also linked to a problem of the past: expensive storage (especially fast storage).
Nowadays SSDs with decent random IO are quite cheap to the point where even low-end hosting has them, and getting spinning disks is a choice reserved for large media files. On the consumer side, we are below a hundred dollars for one TB, so the storage savings are not very relevant.
But if you go back to when Linux was designed, storage was painfully expensive, and potentially saving even a few hundred megabytes was pretty good.
But I do agree that shared libraries are generally a bad idea and something that will most likely create problems at some point. Self-contained software makes a lot more sense generally speaking. And I definitely think that Docker is a dumb "solution" for software distribution but the problem really starts with devs using way too many dependencies.
I'm firmly in the SO camp.
One thing I see is that folk who are making software for Linux target Distribution rather than Linux generic. It's because of the SO "problem".
It's a benefit of the tight coupling of Windows or Mac (which I don't use).
Full agree that security updates in the Docker world is like the problem from static builds.
Disclosure: I ship some stuff on Linux; for the problems we do static & docker. The demand side also seems to favour docker. I also prefer the docker method; for the compatibility reasons.
Strong agree on the “target distro” or even “target specific env” versus Linux.
I think I disagree on Windows being tightly coupled. Windows simply has a barren global library path. Programs merely included their extra DLLs adjacent to the exe. Very simple and reliable.
Linux has added complexity in that glibc sucks and is horribly architected. GCC/Clang are bad and depend on the local environment waaaay too much. Linux ecosystem is very much not built to support sharing cross compiled binaries. It’s so painful. :(
Docker is overkill if all you really need is app packaging.
Docker containers may not be portable anyway when the CUDA version used in the container has to match the kernel driver and GPU firmware, etc.
Some people, when confronted with a problem, think "I know, I'll use Nix." Now they have two problems.
Seems like anti-intellectualism is spreading at HN, too.
Oh, please.
The only anti-intellectualism is not accepting that every technology has tradeoffs.
Yup, and there's a high correlation between people rewriting everything in Rust and converting everything else to Nix. It's like a complexity fetish.
It's removing complexity elsewhere, usually much more. Once you have invested in a relatively fixed bit of complexity, your other tasks become much easier to complete.
Once you have invested in understanding the Clifford algebra, your whole classical electrodynamics turns from 30 equations into two.
Once you have invested in writing a Fortran compiler, you can write numerical computations much easier and shorter than in assembly.
Once you have invested in learning Nix, your whole elaborate build infra, plus chef / puppet / saltstack suddenly shrinks to a few declarations.
Etc.
> Once you have invested in learning Nix, your whole elaborate build infra, plus chef / puppet / saltstack suddenly shrinks to a few declarations.
Your analogy breaks down with Nix, since learning and using it is a hostile experience, unlike (I assume) your other examples.
I have been using NixOS for about 5 years now on several machines, and I still don't know what I'm doing. Troubleshooting errors and implementing features is like climbing a mountain in the dark.
The language syntax is alien. Most functionality is unintuitive. The errors are cryptic. The documentation ranges from poor to nonexistent. It tries to replace every package manager in existence. The community is unwelcoming.
Guix addresses some of these issues, and at least it uses a sane language, but the momentum is, unfortunately, largely with Nix.
Nix pioneered many important concepts that most operating systems and package managers should have. But the implementation and execution of those concepts leaves a lot to be desired.
So, sure, if you and your team have patience to deal with all of its shortcomings, I can see how it can be useful. But personally, I would never propose using it in a professional setting, and would rather use established and more "complex" tooling most engineers are familiar with.
"your entire static website is running on GitHub pages? Sounds like legacy tech debt. I need to replace it with kubernetes pronto"
The change with some engineers is a bit that if there's no user problem to solve, they'll happily solve some hypothetical problem.
Having said that, my weekend project was "upgrading" my RSS reader to run HA on Kubernetes.