The article assumes there are people who want clusters. But a single Linux VM in the cloud can scale pretty far. Separate VM's for different apps works well for isolation. Why do I need a cluster?
you won't have to deal with yaml for these clusters
let me draw this out the way i've been playing with:
a classic vm exists, and supports kvm — this means you can run stuff like firecracker in there
an ssh server runs on this vm, and when you connect to it, you're dropped into a repl/tui where you can list existing microvms, create new ones, or destroy existing ones, and, of particular use, you can attach to one.
as an added nicety, if you connect with `ssh user+dev@example.com`, your connection skips the management interface and you are dropped into the `dev` machine — if it didn't yet exist, you wait 3s, and now it does
vms can talk to each other internally, can connect out, and persist if the server needs a restart
what i don't have yet is proper multi-tenancy, it treats each ssh key as an account, which is fine since it's just me; incoming connections is not figured out, internal supervisor to keep services running inside each microvm, isolation inside firecracker, snapshoting or backups, and the whole shopping list that would make it an actual mvp
I run a single-node K8s cluster on a dedicated server because it's way cleaner to manage than the previous mess and mix of docker compose + traefik routing + random stuff installed as package on the host.
I can create "vhosts" for practically anything in a declarative manner, and if the cluster blows up, I have 5 small scripts to bootstrap it and all I need is `kubectl apply -k .`.
I briefly played with k3s before realising than with a single machine I was maintaining a lot of complexity for limited benefits. Then I switched to NixOS, have everything declared in configuration and a much leaner and simpler setup
I think k8s starts making sense when you have to manage more than 10-15 machines. Better yet, 50-100 machines. Especially if these 200 machines actually run 3-4 types of containers total.
Usually it's rather unlike a sane dev setup. Even if your prod setup uses hundreds microservices (you're Google or Uber or something like that), you don't want to run all of them in your personal dev environment, you reuse 90%-99% of stable microservices running in the QA / integration / whatnot environment, and only run a handful locally.
Never understood the appeal of Kubernetes to developers, outside of a massive deployments. Always felt like a poor man's Linux for those that insist on using apple or windows desktop.
I am not sure I understand this argument. Kubernetes typically runs on Linux. I use an Apple laptop, work mostly with headless Linux VMs and Kubernetes. What is a “poor man’s Linux”?
Does your apple laptop run Linux or MacOS? Do you run Kubernetes locally or only when network permits? What was the reason for targeting Linux rather than MacOS? And what in this context is the value add of using Kubernetes for your development?
Yeah I’ve been doing this with tailscale and a single vps and it’s been wonderful. Unless you’re planning to have millions of users I don’t think there’s any reason to have a cluster.
Maybe they’re assuming some massive amount of compute will be necessary for future tasks? Self hosted LLMs? I’m currently finding it difficult to come up with more uses for my vps beyond hosting trillium and some personal applications I’ve made
Isn't there a meaningful sense in which "separate VMs for different apps" constitutes a cluster?
The "cooperative task" they're engaged in is just, broadly, meeting your needs, whatever they are.
The isolation is a desirable property, and I agree this is much preferable to a highly inter-coupled bunch of machines, and also that thia stretches the typical sense in which we refer to a "compute cluster", but I don't think it's an entirely invalid framing of the term.
> Isn't there a meaningful sense in which "separate VMs for different apps" constitutes a cluster?
Not really. In my experience clustering implies multiple compute elements serving the same function with a coordination mechanism to provide redundancy and/or enhanced capacity.
if you run firecracker inside the rented cloud vm, and you let a few of them run, and perhaps interact with each other, you have essentially created a cluster of microvms that's hosted on a single machine
as argued by OP, you can see this happening with exe.dev, and less explicitly with sprites.dev
>Why do I need a cluster
Supposedly because a box with dual AMD EPYC 9965, 12TB of RAM, 10 x Nvidia H200and 1PB storage might not be enough to run the latest version of Solitaire or Minesweeper and you need more oomph.
Or maybe you want to run stuffz on 1000 x Raspberry Pi just for fun.
Wouldn’t it be cheaper / less complex to scale vertically (eg a large workstation or medium size bare metal server) instead of using clusters? My understanding is that clusters are primarily useful when you want to share a resource from a pool across unpredictable usage, which becomes a moot point once the cluster is personal.
Scale isn’t the only reason. Sometimes you want resource isolation and self-healing, something that is useful if you want a personal swarm of AI agents.
I get how running in a container or vm would help with that, but why would you want to cluster multiple of them? Are you isolating the agents from one another?
A cluster is not happening for the people at large, considering individual systems still can be very powerful and very expensive. This cost won't really come down until we have stably been at 7 angstorms for a decade. This probably means by 2045 at the very earliest. Until then, personal clusters seem extraneous.
Wirh regard to AI, hopefully we can run it efficiently on an ASIC.
No idea about ClusterOS, but I would recommend IncusOS if you're looking for a nice clustering solution. Incus has become indispensable in my homelab over the past few months. It's what I put on my bare metal machines and then spin up Talos Linux VMs for day job practice.
I really liked IncusOS but it still felt quite primitive compared to Proxmox. I also didn’t really like the way it bundles VMs and containers into an ‘instance’ concept, it made the UI and management via Terraform confusing. Had a lot of problems with the TF provider too.
How does the IncusOS API compare to Talos? When I first looked at it it seemed very minimal and I didn't see a lot of options for more complex installs (eg network bonding, disk partitioning).
As far as I can tell and from some quick researching of the guys previous experience, that's all it is. I think the implication is that LLM's will be architecting and deploying the cluster setups at some point? Which sounds horrific so I'm assuming I am interpreting it long
The article itself reminds me of the enthusiasm I felt for plan9 when I first heard about it back in uni. I also thought everyone should have their own compute grids and that clustered computing was the future; of course now I realize there's a lot of reasons why that doesn't actually work. Considering this appears to be a start-up ad, I hope the author knows something I don't.
I'm assuming you're at least overseeing the creation/updates of the Ansible playbooks and have some familiarity with what is being managed outside of that. While I personally would not do that[0], I can see the reasoning behind it.
ClusterdOS appears to be a kubernetes-in-a-box multiple node setup that's goal is to work so well that the user doesn't know or care what it's doing. I wouldn't trust an LLM with managing one machine by itself, let alone a whole cluster of them running the incredibly complex mess that Kubernetes is (and that's not even counting the 8 other layers of software this is), so this feels like an order of magnitude worse.
[0] Using LLMs for sysadmin research or boilerplate writing is one thing, but after a certain amount of use you're really just paying $X a month for Anthropic to manage your systems for you. I'd rather just pay a real person to do it at that point. I'd also rather people get over their pathological fear of learning how to run a server but I've given up on that.
I've been using various UNIXes and clones since the 90s so I do generally know what's going on but I also have no desire to fill my brain with the syntax to the new new new commands to configure an Ethernet interface on Linux etc, or the work necessary to understand fully why VA-API on a certain chip has specific quirks that break freerdp, nor the toil of backporting and patching the necessary libraries, or the specific dance required to set up a machine to TFTP new firmware onto one of the switches, or.... You get the idea.
I'm also not a fan of all the complexity of Kubernetes, one directory with simple to read files makes it a lot more transparent what's where and how it's set up, and the commit history + changelogs make it relatively clear what's changing and why. No distributed database or fancy bootstrapping, just a ubiquitous config format and tool to apply. Changes at the granularity of "a new host is available at A.B.C.D, configure it as a dev server" or "add a new Debian system container named 'blah' to X, bridge it to the research network only, limit to 16 hyperthreads / 64 gb ram, set up for development on git://<whatever>". It works ok for now.
The next major change will be when models that run locally are capable enough to drive the config changes themselves.
I’m not sure quite what this is trying to say. My laptop is already a personal cluster — it has 16 cores, lots of storage, a fast network, I run VMs on it. It’s been the case for a long time that you can run bursty jobs in the cloud if you need more power for a brief period than whatever is currently locally affordable. That’s kind of what the cloud is for, really. So what’s new?
It’s pretty fun to throw a thousand cores at a problem, but I guess it won’t be that long before you can get that in two-socket AMD workstation or whatever.
You're drawing an incorrect conclusion from that site. Aside from the fact that "fitting in RAM" is not the only criterion for needing a cluster, the fact that it's possible to fit data into RAM on a single machine doesn't mean that's the most cost-effective, practical, or sensible solution.
A big advantage of clusters, and horizontal scaling in general, is the ability to easily dynamically scale to meet demand.
If you're running a system on a single machine that has N GB of memory and you need to scale to N+1, what do you do? Provision a new machine and migrate everything over?
No-one operates online real-time systems like this. Clusters make it much easier and less expensive to handle this.
On top of that, it's probably true that in some pure numerical problem-count sense, "most problems" don't need a cluster, but that's misleading. It's like saying "most businesses are mom-and-pop shops." Perhaps true, but it ignores hundreds of thousands of larger businesses, or even small business that have big data needs.
There are plenty of problems that involve large amounts of data, and that's increasingly true with ML applications.
I'm at a company of ~100 people which you've probably never heard of (classified as a "small" company in government stats, so not included in the hundreds of thousands figure I mentioned above.) We have 1.9 PB of data for our main environment. When we run processes that deal with it all, the clusters scale to thousands of vCPUs and tens of terabytes of RAM.
Several processes that run daily scale to 500+ vCPUs and many TB of RAM. For the latter, the data itself could probably fit in RAM on a humongous machine, but the CPUs wouldn't fit on a single machine. And we'd have to size the machines carefully every time we start them up. Clusters can scale up dynamically according to the demands of the jobs they're executing.
Even in a physical hardware, on-premise scenario, it's still easier to scale horizontally than vertically in almost all cases, for all the reasons I mentioned. That's a big reason why Kubernetes was adopted at an unprecedented pace at medium to large organizations - because it helps manage that approach.
They could have chosen Mesos instead. Kubernetes had other characteristics that allowed it to be adopted far and wide besides the ability to scale horizontally.
Besides, Mesos wasn't a good alternative for most companies, so saying "they could have chosen it instead" is a bit theoretical. Mesos was ambitious, but that made it less suitable for a plug & play system that fit easily into existing corporate systems, which had already adopted containers heavily.[] Another reason for Kubernetes' popularity is it didn't try to be a big leap forward the way Mesos did.
[]The Marathon container support for Mesos was released about a year after Kubernetes, but if you were going to set up a system for distributed orchestration of containers, it didn't make much sense to bring Mesos along for the ride. There's a reason Mesos is in Apache heaven now (the Attic.)
that's..kind of not true. they weren't elastic in the sense that you never had to think about how big they were. but you had say 64k nodes, and people would launch jobs with 1000 of them, or 10000, or if if they could clear the decks all of them. or if they were just debugging, maybe 5 of them.
One could argue that multiple cores are already not seamless especially if you have NUMA (now available in high-end desktops by the way! and every multi-socket system that's ever existed) and the distinction between RAM and disk is very not seamless and so is any other number of things you'd hope the OS would magically handwave away for you but it doesn't.
10Gbps is now very cheap and 100Gbps is viable at hobby scale. That's Ethernet. I don't know anything about CXL and so on.
Exactly how I'm thinking about it. NUMA, the RAM/disk hierarchy, CXL. Operators have always abstracted over nonuniform substrates with very different latency tiers. The fabric inside a modern server is already a small network.
But the argument for an OS at the cluster level isn't that the interconnect becomes seamless, but that the substrate becomes standardized, regardless of underlying hardware.
we built machines with all kinds of approach to this. ones with giant shared memories and memory networks. the tera MTA famously had uniform memory access, since all of the memories were on the other side of a network from the CPU, and hardware managed threads tried to hide that latency.
we built machines with RDMA that allowed fast one-sided transfers between memories at a decent fraction of the memory bandwidth. and operating systems that ran services to present a unified operating system interface on top of that.
there is a whole history of distributed operating systems if you're interested
I have an irrational soft spot for Apache Mesos. I loved the separation of the resource management from the scheduling. Note to self: do not rabbit hole on this. Hm. Maybe mesos is the manager for my agent sandboxes. No! Bad lowbloodsugar!
The article assumes there are people who want clusters. But a single Linux VM in the cloud can scale pretty far. Separate VM's for different apps works well for isolation. Why do I need a cluster?
Configuring one box is enough of a pain. I guess AI fixes that though. I don't need to learn box wrangling if the boxes wrangle themselves.
> Why do I need a cluster?
Uptime, self healing, reproducibility, separating the system from app. There's probably a half dozen more.
K8s comes with resource consumption tax certainly but for anything beyond the trivial it's usually justified.
> Separate VM's for different apps works well for isolation
Sounds inefficient along with a lot more work doing the plumbing than simply writing a 100 lines of yaml.
Who wants to deal with YAML? Sometimes the easiest way to set up a VM is by talking to your phone:
https://commaok.xyz/ai/just-in-time-software/
I mean, I don't do that, but I'll type a prompt.
you won't have to deal with yaml for these clusters
let me draw this out the way i've been playing with:
a classic vm exists, and supports kvm — this means you can run stuff like firecracker in there
an ssh server runs on this vm, and when you connect to it, you're dropped into a repl/tui where you can list existing microvms, create new ones, or destroy existing ones, and, of particular use, you can attach to one.
as an added nicety, if you connect with `ssh user+dev@example.com`, your connection skips the management interface and you are dropped into the `dev` machine — if it didn't yet exist, you wait 3s, and now it does
vms can talk to each other internally, can connect out, and persist if the server needs a restart
what i don't have yet is proper multi-tenancy, it treats each ssh key as an account, which is fine since it's just me; incoming connections is not figured out, internal supervisor to keep services running inside each microvm, isolation inside firecracker, snapshoting or backups, and the whole shopping list that would make it an actual mvp
Sounds like a nice setup. The way exe.dev does things seems somewhat similar.
yep, a lot of this is based on what exe.dev already does, some of it takes more inspiration from sprites.dev, and others are wishlist items
> Why do I need a cluster?
I run a single-node K8s cluster on a dedicated server because it's way cleaner to manage than the previous mess and mix of docker compose + traefik routing + random stuff installed as package on the host.
I can create "vhosts" for practically anything in a declarative manner, and if the cluster blows up, I have 5 small scripts to bootstrap it and all I need is `kubectl apply -k .`.
I briefly played with k3s before realising than with a single machine I was maintaining a lot of complexity for limited benefits. Then I switched to NixOS, have everything declared in configuration and a much leaner and simpler setup
I think k8s starts making sense when you have to manage more than 10-15 machines. Better yet, 50-100 machines. Especially if these 200 machines actually run 3-4 types of containers total.
Usually it's rather unlike a sane dev setup. Even if your prod setup uses hundreds microservices (you're Google or Uber or something like that), you don't want to run all of them in your personal dev environment, you reuse 90%-99% of stable microservices running in the QA / integration / whatnot environment, and only run a handful locally.
Never understood the appeal of Kubernetes to developers, outside of a massive deployments. Always felt like a poor man's Linux for those that insist on using apple or windows desktop.
I am not sure I understand this argument. Kubernetes typically runs on Linux. I use an Apple laptop, work mostly with headless Linux VMs and Kubernetes. What is a “poor man’s Linux”?
Does your apple laptop run Linux or MacOS? Do you run Kubernetes locally or only when network permits? What was the reason for targeting Linux rather than MacOS? And what in this context is the value add of using Kubernetes for your development?
Yeah I’ve been doing this with tailscale and a single vps and it’s been wonderful. Unless you’re planning to have millions of users I don’t think there’s any reason to have a cluster.
Maybe they’re assuming some massive amount of compute will be necessary for future tasks? Self hosted LLMs? I’m currently finding it difficult to come up with more uses for my vps beyond hosting trillium and some personal applications I’ve made
Isn't there a meaningful sense in which "separate VMs for different apps" constitutes a cluster?
The "cooperative task" they're engaged in is just, broadly, meeting your needs, whatever they are.
The isolation is a desirable property, and I agree this is much preferable to a highly inter-coupled bunch of machines, and also that thia stretches the typical sense in which we refer to a "compute cluster", but I don't think it's an entirely invalid framing of the term.
> Isn't there a meaningful sense in which "separate VMs for different apps" constitutes a cluster?
Not really. In my experience clustering implies multiple compute elements serving the same function with a coordination mechanism to provide redundancy and/or enhanced capacity.
JBOD vs. RAID.
if you run firecracker inside the rented cloud vm, and you let a few of them run, and perhaps interact with each other, you have essentially created a cluster of microvms that's hosted on a single machine
as argued by OP, you can see this happening with exe.dev, and less explicitly with sprites.dev
MPI is kind of fun to write.
>Why do I need a cluster Supposedly because a box with dual AMD EPYC 9965, 12TB of RAM, 10 x Nvidia H200and 1PB storage might not be enough to run the latest version of Solitaire or Minesweeper and you need more oomph.
Or maybe you want to run stuffz on 1000 x Raspberry Pi just for fun.
Wouldn’t it be cheaper / less complex to scale vertically (eg a large workstation or medium size bare metal server) instead of using clusters? My understanding is that clusters are primarily useful when you want to share a resource from a pool across unpredictable usage, which becomes a moot point once the cluster is personal.
Scale isn’t the only reason. Sometimes you want resource isolation and self-healing, something that is useful if you want a personal swarm of AI agents.
I get how running in a container or vm would help with that, but why would you want to cluster multiple of them? Are you isolating the agents from one another?
A cluster is not happening for the people at large, considering individual systems still can be very powerful and very expensive. This cost won't really come down until we have stably been at 7 angstorms for a decade. This probably means by 2045 at the very earliest. Until then, personal clusters seem extraneous.
Wirh regard to AI, hopefully we can run it efficiently on an ASIC.
No idea about ClusterOS, but I would recommend IncusOS if you're looking for a nice clustering solution. Incus has become indispensable in my homelab over the past few months. It's what I put on my bare metal machines and then spin up Talos Linux VMs for day job practice.
I really liked IncusOS but it still felt quite primitive compared to Proxmox. I also didn’t really like the way it bundles VMs and containers into an ‘instance’ concept, it made the UI and management via Terraform confusing. Had a lot of problems with the TF provider too.
How does the IncusOS API compare to Talos? When I first looked at it it seemed very minimal and I didn't see a lot of options for more complex installs (eg network bonding, disk partitioning).
I'm actually confused about what ClusterdOS is and does besides glue a bunch of projects together in an opinionated way.
It sits on top of Kubernetes and seems very hand wavy about how you create and manage those clusters.
As far as I can tell and from some quick researching of the guys previous experience, that's all it is. I think the implication is that LLM's will be architecting and deploying the cluster setups at some point? Which sounds horrific so I'm assuming I am interpreting it long
The article itself reminds me of the enthusiasm I felt for plan9 when I first heard about it back in uni. I also thought everyone should have their own compute grids and that clustered computing was the future; of course now I realize there's a lot of reasons why that doesn't actually work. Considering this appears to be a start-up ad, I hope the author knows something I don't.
Claude Code + Ansible + whatever stuff needs to be managed gives some visibility and control and in my experience is reliable enough to be useful.
I'm assuming you're at least overseeing the creation/updates of the Ansible playbooks and have some familiarity with what is being managed outside of that. While I personally would not do that[0], I can see the reasoning behind it.
ClusterdOS appears to be a kubernetes-in-a-box multiple node setup that's goal is to work so well that the user doesn't know or care what it's doing. I wouldn't trust an LLM with managing one machine by itself, let alone a whole cluster of them running the incredibly complex mess that Kubernetes is (and that's not even counting the 8 other layers of software this is), so this feels like an order of magnitude worse.
[0] Using LLMs for sysadmin research or boilerplate writing is one thing, but after a certain amount of use you're really just paying $X a month for Anthropic to manage your systems for you. I'd rather just pay a real person to do it at that point. I'd also rather people get over their pathological fear of learning how to run a server but I've given up on that.
I've been using various UNIXes and clones since the 90s so I do generally know what's going on but I also have no desire to fill my brain with the syntax to the new new new commands to configure an Ethernet interface on Linux etc, or the work necessary to understand fully why VA-API on a certain chip has specific quirks that break freerdp, nor the toil of backporting and patching the necessary libraries, or the specific dance required to set up a machine to TFTP new firmware onto one of the switches, or.... You get the idea.
I'm also not a fan of all the complexity of Kubernetes, one directory with simple to read files makes it a lot more transparent what's where and how it's set up, and the commit history + changelogs make it relatively clear what's changing and why. No distributed database or fancy bootstrapping, just a ubiquitous config format and tool to apply. Changes at the granularity of "a new host is available at A.B.C.D, configure it as a dev server" or "add a new Debian system container named 'blah' to X, bridge it to the research network only, limit to 16 hyperthreads / 64 gb ram, set up for development on git://<whatever>". It works ok for now.
The next major change will be when models that run locally are capable enough to drive the config changes themselves.
* * *
I’m not sure quite what this is trying to say. My laptop is already a personal cluster — it has 16 cores, lots of storage, a fast network, I run VMs on it. It’s been the case for a long time that you can run bursty jobs in the cloud if you need more power for a brief period than whatever is currently locally affordable. That’s kind of what the cloud is for, really. So what’s new?
It’s pretty fun to throw a thousand cores at a problem, but I guess it won’t be that long before you can get that in two-socket AMD workstation or whatever.
The best part of this article is in the footnotes:
> see CEO of Tailscale apenwarr's vibe-researched thread
“Vibe-research” is now a core part of my vocabulary.
Clusters are almost never the right answer for most problems: https://yourdatafitsinram.net/
Most data problems don't need to fit in RAM.
You're drawing an incorrect conclusion from that site. Aside from the fact that "fitting in RAM" is not the only criterion for needing a cluster, the fact that it's possible to fit data into RAM on a single machine doesn't mean that's the most cost-effective, practical, or sensible solution.
A big advantage of clusters, and horizontal scaling in general, is the ability to easily dynamically scale to meet demand.
If you're running a system on a single machine that has N GB of memory and you need to scale to N+1, what do you do? Provision a new machine and migrate everything over?
No-one operates online real-time systems like this. Clusters make it much easier and less expensive to handle this.
On top of that, it's probably true that in some pure numerical problem-count sense, "most problems" don't need a cluster, but that's misleading. It's like saying "most businesses are mom-and-pop shops." Perhaps true, but it ignores hundreds of thousands of larger businesses, or even small business that have big data needs.
There are plenty of problems that involve large amounts of data, and that's increasingly true with ML applications.
I'm at a company of ~100 people which you've probably never heard of (classified as a "small" company in government stats, so not included in the hundreds of thousands figure I mentioned above.) We have 1.9 PB of data for our main environment. When we run processes that deal with it all, the clusters scale to thousands of vCPUs and tens of terabytes of RAM.
Several processes that run daily scale to 500+ vCPUs and many TB of RAM. For the latter, the data itself could probably fit in RAM on a humongous machine, but the CPUs wouldn't fit on a single machine. And we'd have to size the machines carefully every time we start them up. Clusters can scale up dynamically according to the demands of the jobs they're executing.
Not all clusters are elastic. Cloud infrastructure can be, but HPC setups before the cloud were not.
Even in a physical hardware, on-premise scenario, it's still easier to scale horizontally than vertically in almost all cases, for all the reasons I mentioned. That's a big reason why Kubernetes was adopted at an unprecedented pace at medium to large organizations - because it helps manage that approach.
They could have chosen Mesos instead. Kubernetes had other characteristics that allowed it to be adopted far and wide besides the ability to scale horizontally.
I said a big reason, not the only reason.
Besides, Mesos wasn't a good alternative for most companies, so saying "they could have chosen it instead" is a bit theoretical. Mesos was ambitious, but that made it less suitable for a plug & play system that fit easily into existing corporate systems, which had already adopted containers heavily.[] Another reason for Kubernetes' popularity is it didn't try to be a big leap forward the way Mesos did.
[]The Marathon container support for Mesos was released about a year after Kubernetes, but if you were going to set up a system for distributed orchestration of containers, it didn't make much sense to bring Mesos along for the ride. There's a reason Mesos is in Apache heaven now (the Attic.)
that's..kind of not true. they weren't elastic in the sense that you never had to think about how big they were. but you had say 64k nodes, and people would launch jobs with 1000 of them, or 10000, or if if they could clear the decks all of them. or if they were just debugging, maybe 5 of them.
so I guess idk what you mean by 'elastic' here.
The site link for GitHub, has the GitLab icon and link
Does anyone now what is the font used in the article? I like ut a lot.
Imagine a Beowulf cluster of these!
I think people are putting together pi clusters for their homelab these days.
I don't see how an operating system can work for a cluster.
You can have more than one CPU and more than one storage connected to one mainboard and that works because the interconnect fabric is very fast.
We don't have have the possibility to connect different computers at the same kind of speed that would let them work together seamlessly.
One could argue that multiple cores are already not seamless especially if you have NUMA (now available in high-end desktops by the way! and every multi-socket system that's ever existed) and the distinction between RAM and disk is very not seamless and so is any other number of things you'd hope the OS would magically handwave away for you but it doesn't.
10Gbps is now very cheap and 100Gbps is viable at hobby scale. That's Ethernet. I don't know anything about CXL and so on.
Exactly how I'm thinking about it. NUMA, the RAM/disk hierarchy, CXL. Operators have always abstracted over nonuniform substrates with very different latency tiers. The fabric inside a modern server is already a small network. But the argument for an OS at the cluster level isn't that the interconnect becomes seamless, but that the substrate becomes standardized, regardless of underlying hardware.
Check out Plan 9 and Mosix. They weren't super fast but they worked.
we built machines with all kinds of approach to this. ones with giant shared memories and memory networks. the tera MTA famously had uniform memory access, since all of the memories were on the other side of a network from the CPU, and hardware managed threads tried to hide that latency.
we built machines with RDMA that allowed fast one-sided transfers between memories at a decent fraction of the memory bandwidth. and operating systems that ran services to present a unified operating system interface on top of that.
there is a whole history of distributed operating systems if you're interested
I have an irrational soft spot for Apache Mesos. I loved the separation of the resource management from the scheduling. Note to self: do not rabbit hole on this. Hm. Maybe mesos is the manager for my agent sandboxes. No! Bad lowbloodsugar!
How are resource management distinct from scheduling in Mesos?
Buddy 90% of people can’t even open a word document without immense stress