> A CPU with 4 cores is going to have the capacity of executing 4 seconds of CPU-time per second. It does not matter how much “background idle threading” you do or don’t. The CPU doesn’t care. You always have 4 seconds of CPU-time per second. That’s an important concept to understand.
> If you write a program in the design of Node.js — isolating a portion of the problem, pinning it to 1 thread on one CPU core, letting it access an isolated portion of RAM with no data sharing, then you have a design that is making as optimal use of CPU-time as possible
This is .. not true as written? You get one second of CPU time per second, not four. Now, it may be quite hard to reach your full four seconds of CPU time per second, usually because of RAM bandwidth issues despite all the caching, and a hyperthreading fake "core" absolutely does not count the same as a separate die core, but the difference is real.
Author does have a point that slicing the work too small has significant overheads. But they've overstated it.
And this is before we get into the real source of parallel FLOPS, the GPU.
(edit: note that there may also be thermal issues and CPU frequency scaling going on; it is usually impossible to run all cores of a modern CPU at their max rated frequency for more than a very short time! But if you've bought a 64-core Ryzen and are only using one core, there's a huge gap there which you're not using)
> Author does have a point that slicing the work too small has significant overheads. But they've overstated it.
Exactly.
I was messing around with adding multi-threading to this 3d thing and it slowed it down for the smaller cases up until it overcame the overhead then it sped things up. It was using OpenMP and only a couple shared loop variables so probably not as drastic as whatever node does but it did slow the common case down enough to be not worth the effort.
The author of TFA needs to go run any renderer in single and multi-thread mode then report back to the class.
> The author of TFA needs to go run any renderer in single and multi-thread mode then report back to the class.
Indeed. The whole of modern graphics API architecture hinges on the idea that each of your million or so pixels is a meaningful unit of work that can be done in parallel.
I still think that the argument is flawed beyond trivially parallel problems, but my understanding is that the author is arguing for shared-nothing, not for single threaded/single-process solutions.
> The irony is that, since you are splitting the problem in a way that requires synchronization between cores, you are actually introducing more work to be executed in the same CPU-time budget. So you are spending more time on overhead due to synchronization, which does the opposite of what you probably hoped for — it makes your code even slower, not faster.
That is certainly not universally true for every scenario and if you need to sync state between cpu cores very often then your tasks simply don't lend themselves to parallelization. That doesn't mean that multi-threading is inheritely the wrong design choice. Of course it will always be a trade-off between performance gains and the code complexity of your job control.
But people new to thinking about concurrency don’t know this. They write “parallel” code that goes pretty fast, and that initial reward convinces them to continue. Then the bugs are found, and their parallel code gets slower, but that’s just bug fixes. Never mind the performance loss.
I'm perfectly happy to accept that multi-threading isn't the best solution to certain classes of problems, and the author makes compelling arguments to this end, but they haven't said what classes of problems shouldn't be solved with multi-threading.
That's the fundamental weakness of all these 'best-practice' blog posts: authors are much more willing to pluck a tool from their toolbox and tell us to use that one than they are to give advice on how to pick the right tool for the right job.
> If you write a program in the design of Node.js — isolating a portion of the problem, pinning it to 1 thread on one CPU core, letting it access an isolated portion of RAM with no data sharing, then you have a design that is making as optimal use of CPU-time as possible. It is how you optimize for NUMA systems and CPU cache locality. Even a SMP system is going to perform better if treated as NUMA.
It is true that you can only squeeze 100% of the maximum possible useful compute out of a NUMA system with methods like the article author was suggesting. The less coordination there is between cores, the less cross-core or cross-socket communication is needed, all of which is overhead.
Caveat: If a bunch of independent processes are processing independent data, they'll increase cache thrashing at L2 and higher levels. Synchronised threads running the same code more-or-less in lockstep over the same areas of the data can benefit from sharing that independent processes can't. In some scenarios, this can be a huge speedup -- just ask a GPU programmer!
Where the process-per-core argument definitely stops being a good approach is when you start to consider latency.
Literally just this week, I need to help someone working on a Node.js app that needs to pre-cache a bunch of very expensive computations (map tiles over data changing on an interval).
Because this is CPU-heavy and Node.js is single-threaded, it kills the user experience while it is running. Interactive responses get interleaved with batch actions, and users complain.
This is not a problem with ASP.NET where this kind of work can simply run in a background thread and populate the cache without interfering with user queries!
Someone familiar with this please confirm: if you have a node.js app, which is true single-threaded, do people run multiple copies per physical CPU, or do they just max out one core and leave the rest idle? Or lease the cores separately from their hosting provider by running a bunch of one-core container instances or something?
Node supports 2 ways of getting an additional event loop threads (this being the “single-threaded” part that people talk about with Node often without understanding much about its internals, as the Node process itself spawns many threads in the background).
The first mode is child process: the main process forks an entirely separate instance of Node with its own event loop, which you communicate with over IPC or some network socket.
The second mode (introduced fairly recently) is the ability to spin off worker threads which have their own event loop but share the worker thread pool of the main process. I think there is a way to share memory between these threads via some special type of buffer, but I have never used them.
The first mode maps directly to the idea of micro-services, just running on the same machine. This is why it is not really used AFAIK in modern cloud based apps, with single core micro service instances used instead. This approach has a higher latency cost but allows cheaper instances and much simpler services - it very much depends on the use case to decide if that is correct choice or not.
I mentally translate “modern web development” to: proxies for the proxies and layers of load balancers upon load balancers to make the envoys work with the ingress, all through an API management layer for good measure… and then a CDN.
It’s a common architecture antipattern to try to run non-interactive tasks on the same resources as interactive ones.
Even if you avoid the initial problems by punching the CPU priority through the floor, someone will eventually introduce a bug/feature that increases IOPS or memory usage drastically, or finds some other way to accomplish priority inversion. You will think you can deploy this code any time you want and discover that you can’t.
And having survived the first and second crises, you will arrive at the final one: a low-priority process is not a no-priority process. People will expect it to be completed “on time”. And as the other processes saturate the available resources via either code bloat or right sizing, now your background task isn’t completing the way people have come to expect it to complete. It stops being able to be “free” by parasitism and has to have its own hardware.
Web servers are notoriously easy to implement with message passing / fork. Threading is a particularly bad model for that case.
Games on the other hand? Oh yeah, you definitely need threads for that.
It's important to note that "multithreading" specifically means shared-memory model of parallel processing. Games are more of an exception than the rule when it comes to being well suited to shared memory.
Isn’t Nginx multithreaded rather than multiprocess? I’m not an expert in web stuff but it’s always felt intuitively that worker threads map nocely into typical web server workloads.
Also, any UI app should basically be multithreaded to prevent interface hiccups.
yeah, definitely. I do GPU coding at work and that's hell. Optimizing the memory saturation and cpu saturation is maddening and always comes with surprises.
(but when it works, it's so gratifying! the GPU go very very fast)
This guy's patronizing rants has made me avoid his software (µWebSockets) even though it seems good. Stop trying to be Torvalds. Even the docs used to be riddled with it:
> In the 1970s, programming was an elite's task. Today programming is done by uneducated "farmers" and as a result, the care for smart algorithms, memory usage, CPU-time usage and the like has dwindled in comparison.
Forgetting the tone for a moment but I can say I see lots of people using things they don't understand and making hard problems for themselves. To be fair I know that I don't understand those things properly and yet see people rushing in to use them who know even less.
With threads it was a messaging service that was supposed to offer a persistent queue that could survive restarts. It possibly doubled or trebled the length of the project through multithreading bugs. I wasn't the creator of the code - it was someone without a degree. In the end I had to solve one bug on it that took 3 months to work out and that was just a double-free in some odd circumstance. Nobody wanted to touch it and muggins (i.e. me) was the last person without an excuse!
Asynchronous python and python threading are the recent ones I've experienced - javacript programmers who decided that python was trivial to learn and tried to speed everything up with threads (makes everything worse until 3.13 with a special compilation option that we could not use) and then they made life even worse to no purpose at all by using async without knowing that the ASGI system underneath didn't support it properly. Uvicorn does but uvicorn wasn't usable in that context.
Apart from creating wonderful opportunities for bugs they didn't even know how to write async unit tests so the tests always passed no matter what you did to them.
When trying to help with these issues I found the attitude to be extremely resistant. No way they were going to listen to me - the annoying whippersnapper in one case or the old fart programmer in the other. They just knew better
Multi-Threading is the worst solution, except for all the other solutions.
Avoiding multi-threading doesn't remove concurrency issues. It just moves them to a different point in the application execution. A point where you don't have debuggers and need to create overly fault-tolerant behavior for everything. This is bad for performance but worse for debugging. With a regular synchronous threaded application I have a clean, obvious stack trace to a failure in most cases. Asynchronous or process based code gives me almost nothing for production failures or even regular debugging.
Using something you don't actually need is wasteful.
When you don't need threads, don't use them.
Trivially parallelizable problems should be trivially parallelized.
Sometimes you want to do something complex in a short time though. If latency matters, sometimes you don't have a choice.
Running 10 instances of the same game at 10fps each is not the answer.
> Again, say what you want about Node.js, but it does have this thing right.
Async/promise/deferred code is just re-implementing separate control callback chains that are not that different than threads when talking about IO. You'll still need mutexes, semaphores and such.
> The best design is the one where complexity is kept minimal, and where locality is kept maximum. That is where you get to write code that is easy to understand without having these bottomless holes of mindbogglingly complex CPU-dependent memory barrier behaviors. These designs are the easiest to deploy and write. You just make your load balancer cut the problem in isolated sections and spawn as many threads or processes of your entire single threaded program as needed
Wholeheartedly agree. That's exactly how Elixir and Erlang processes work, they are small, lightweight and have isolated heaps.
If the author wrote that sincerely, they should promptly drop Node.js on the floor and start learning Elixir.
> Say what you want about Node.js. It sucks, a lot.
I don't think so. V8 is a marvel of engineering. JS is the problem (quirky-to-downright-ugly language that is also extremely ubiquitous), not Node.
> But it was made with one very accurate observation: multithreading sucks even more.
Was it made with that observation? (in that case I would like a source to corroborate that) Or was it simply that when all you have is hammer (single thread execution) then everything looks like a nail (you going to fix problems within that single thread).
The evented/async/non-blocking style of programming that you will need when serving a lot of requests from one thread, was already existing before Node. It was just not that popular. When you choose to employ this style of programming, all your heavy IOing libraries need to be build for it, and they usually were not.
Since Node had no other options, all their IO libs were evented/async/non-blocking from the get go. I don't think this was a design choice, but more a design requirement/necessity.
It is 2025, multithread scalability is a well understood, if not easy, problem.
The reality is that hardware-provided cache coherence is an extremely powerful paradigm. Building your application on top of message passing not only gives away some performance, but it means that if you have any sort of cross thread logical shared state that needs to be kept in sync, you have to implement cache coherence yourself, which is an extremely hard problem.
With my apologies to Greenspun, any sufficiently complicated distributed system contains an ad-hoc, informally-specified, bug-ridden, slow implementation of MESI.
But of course, if you have a trivially parallel problem, rejoice! You do not need much communication and shared memory is not as useful. But not all, or even most, problems are trivially parallel.
"It is 2025, multithread scalability is a well understood, if not easy, problem."
In the 1990s it became "well known" that threading is virtually impossible for mere mortals. But this is a classic case of misdiagnosis. The problem wasn't threading. The problem was a lock-based threading model, where threading is achieved by identifying "critical sections" and trying to craft a system of locks that lets many threads run around the entire program's memory space and operate simultaneously.
This becomes exponentially complex and essentially infeasible fairly quickly. Even the programs of the time that "work" contain numerous bombs in their state space, they've just been ground out by effort.
But that's not the only way to write threaded code. You can go full immutable like Haskell. You can go full actor like Erlang, where absolutely every variable is tied to an actor. You can write lock-based code in a way that you never have to take multiple simultaneous locks (which is where the real murder begins) by using other techniques like actors to avoid that. There's a variety of other safe techniques.
I like to say that these take writing multithreaded code from exponential to polynomial, and a rather small polynomial at that. No, it isn't free, but it doesn't have to be insane, doesn't take a wizard, and is something that can be taught and learned with only reasonable level of difficulty.
Indeed, when done correctly, it can be easier to understand that Node-style concurrency, which in the limit can start getting crazy with the requisite scheduling you may need to do. Sending a message to another actor is not that difficult to wrap your head around.
So the author is arguably correct, if you approach concurrency like it's 1999, but concurrency has moved on since then. Done properly, with time-tested techniques and safe practices, I find threaded concurrency much easier to deal with than async code, and generally higher performance too.
I see it the other way. I’ll admit that I do a lot of “embarrassingly parallel” problems where the answer is “Executor and chill” in Java. I have dealt with quite a few Scala systems that (1) didn’t get the same answer every time and (2) got a 250% speed up with 8 cores and such, and common problems where “error handling with monads theater”, “we are careful about initialization but could care less about teardown (monads again!)” [1] and actors.
The choice is between a few days of messing around with actors and it still doesn’t work and 20 minutes rewriting with Executors and done. The trick with threads is having a good set of primitives to work with and Java gives you that. In some areas of software the idea of composing a minimal set of operations really gets you somewhere, when it comes to threads it gets you to the painhouse,
I went through a phase of having a huge amount of fun writing little server/clients with async Python but switched to sync when the demands in CPU increased. The idea that “parallelism” and “concurrency” aren’t closely related is a bad idea like the alleged clean split between “authentication” and “authorization” —- Java is great because it gives you 100% adequate tools that handle parallelism and concurrency with the same paradigm.
[1] You could do error handling and teardown with monads but drunk on the alleged superiority of a new programming paradigm many people don’t —- so you meet the coders who travel from job to job like itinerant martial artists looking for functional programming enlightenment. TAOCP (Turing) stands the test of time whereas SICP (lambda calculus) is a fad.
I'm a huge fan of the actor model and message passing, so you do not have to sell it to me; I also strongly dislike the current async fad.
But message passing is not a panacea. Sometimes shared mutable state is the solution that is simplest to implement and reason about. If you think about it what are database if not shared mutable state, and they have been widely successful. The key is of course proper concurrency control abstractions.
> multithread scalability is a well understood, if not easy, problem
As someone fairly well versed in MESI and cache optimization: it really isn't. It's a minority of people that understand it (and really, that need to).
> Building your application on top of message passing not only gives away some performance
This really isn't universally true either. If you're optimizing for throughput, pipelining with pinned threads + message passing is usually the way to go if the data model allows for it.
For modern large-scale production systems I agree except in cases where performance is critical. To choose a wild example - a microservice serving geospatial queries over millions of objects that can exist anywhere in the world has plenty of parallelism at the level of each query, but handling multiple queries can be done by scaling horizontally with multiple instances of the service.
Instead of "just throw it on a thread and forget about it" - in a production environment, use the job queue. You gain isolation and observability - you can see the job parameters and know nothing else came across, except data from the DB etc.
Younger me tried saturating CPU cores with event loops for a few years. Then I grew up and started using long living threads and queues to consolidate everything on the main thread and it works a whole lot better for me. Async/await shit show is gone, synchronization primitives you still somehow need for some reason with an event loop are mostly gone, all interfaces use the same synch stack, it is paradise. It almost feels like you have to play around with async to really appreciate the simple and solid thread safe queue approach.
I don't think it's useful to think like this when writing software. Sure you must do more work when going multi threaded. But that doesn't mean you are slower in wall clock time. And wall clock time is the cool kid.
I mean if you go this route then you may as well say zero-copy doesn't exist. Everytime you move things between registers things are copied. I guess OP also disables all their cores and runs their OS on a single core. It's more efficient after all. I take the other view. The more effective CPU time you can use the better for a ton of non UI use cases. So I would say it is more efficient to use 2x the CPU time to reduce wall clock time by 10% for example. In fact any CPU time that is unused is inefficient in a way. It's just sitting there unused, forever lost to time.
Every time I try to increase the performance of my software by using multiple cores, I need a lot of cores to compensate for the loss of per-core efficiency. Like, it might run 2-3 times as fast on 8 cores.
I'm sure I've been doing it wrong. I just had better luck optimizing the performance per core rather than trying to spread the load over multiple cores.
Or your task needs the overhead to sync and read/write data. Only you can tell really with access to code/data, but 3x speed on 8x cores may well be the theoretical maximum you can do for this specific thing.
Synchronization overhead is more than people think, and it can be difficult to tell when you're RAM/cache-bandwidth limited. But it makes a difference if you can make the "unit of work" large enough.
Come on. This is a "mongo web scale" type of article.
CPU bound applications MUST use multithreading to be able to utilize multiple cores. In many cases, the framework knows how to give an API to the developer which masks the need for him to deal with setting up a worker thread pool, such as with web applications frameworks - but eventually you need one.
Learn how to be an engineer, and use the right solution for the problem.
What is his definition of "multi-threading"? Did he specify it as that frames the entire discussion. I took a quick search but saw no mention of it but might have missed where he discusses it.
> A CPU with 4 cores is going to have the capacity of executing 4 seconds of CPU-time per second. It does not matter how much “background idle threading” you do or don’t. The CPU doesn’t care. You always have 4 seconds of CPU-time per second. That’s an important concept to understand.
> If you write a program in the design of Node.js — isolating a portion of the problem, pinning it to 1 thread on one CPU core, letting it access an isolated portion of RAM with no data sharing, then you have a design that is making as optimal use of CPU-time as possible
This is .. not true as written? You get one second of CPU time per second, not four. Now, it may be quite hard to reach your full four seconds of CPU time per second, usually because of RAM bandwidth issues despite all the caching, and a hyperthreading fake "core" absolutely does not count the same as a separate die core, but the difference is real.
Author does have a point that slicing the work too small has significant overheads. But they've overstated it.
And this is before we get into the real source of parallel FLOPS, the GPU.
(edit: note that there may also be thermal issues and CPU frequency scaling going on; it is usually impossible to run all cores of a modern CPU at their max rated frequency for more than a very short time! But if you've bought a 64-core Ryzen and are only using one core, there's a huge gap there which you're not using)
> Author does have a point that slicing the work too small has significant overheads. But they've overstated it.
Exactly.
I was messing around with adding multi-threading to this 3d thing and it slowed it down for the smaller cases up until it overcame the overhead then it sped things up. It was using OpenMP and only a couple shared loop variables so probably not as drastic as whatever node does but it did slow the common case down enough to be not worth the effort.
The author of TFA needs to go run any renderer in single and multi-thread mode then report back to the class.
> The author of TFA needs to go run any renderer in single and multi-thread mode then report back to the class.
Indeed. The whole of modern graphics API architecture hinges on the idea that each of your million or so pixels is a meaningful unit of work that can be done in parallel.
I still think that the argument is flawed beyond trivially parallel problems, but my understanding is that the author is arguing for shared-nothing, not for single threaded/single-process solutions.
> The irony is that, since you are splitting the problem in a way that requires synchronization between cores, you are actually introducing more work to be executed in the same CPU-time budget. So you are spending more time on overhead due to synchronization, which does the opposite of what you probably hoped for — it makes your code even slower, not faster.
That is certainly not universally true for every scenario and if you need to sync state between cpu cores very often then your tasks simply don't lend themselves to parallelization. That doesn't mean that multi-threading is inheritely the wrong design choice. Of course it will always be a trade-off between performance gains and the code complexity of your job control.
But people new to thinking about concurrency don’t know this. They write “parallel” code that goes pretty fast, and that initial reward convinces them to continue. Then the bugs are found, and their parallel code gets slower, but that’s just bug fixes. Never mind the performance loss.
I'm perfectly happy to accept that multi-threading isn't the best solution to certain classes of problems, and the author makes compelling arguments to this end, but they haven't said what classes of problems shouldn't be solved with multi-threading.
That's the fundamental weakness of all these 'best-practice' blog posts: authors are much more willing to pluck a tool from their toolbox and tell us to use that one than they are to give advice on how to pick the right tool for the right job.
> If you write a program in the design of Node.js — isolating a portion of the problem, pinning it to 1 thread on one CPU core, letting it access an isolated portion of RAM with no data sharing, then you have a design that is making as optimal use of CPU-time as possible. It is how you optimize for NUMA systems and CPU cache locality. Even a SMP system is going to perform better if treated as NUMA.
Well that's... just wrong.
It's not wrong, it's just over-simplified.
It is true that you can only squeeze 100% of the maximum possible useful compute out of a NUMA system with methods like the article author was suggesting. The less coordination there is between cores, the less cross-core or cross-socket communication is needed, all of which is overhead.
Caveat: If a bunch of independent processes are processing independent data, they'll increase cache thrashing at L2 and higher levels. Synchronised threads running the same code more-or-less in lockstep over the same areas of the data can benefit from sharing that independent processes can't. In some scenarios, this can be a huge speedup -- just ask a GPU programmer!
Where the process-per-core argument definitely stops being a good approach is when you start to consider latency.
Literally just this week, I need to help someone working on a Node.js app that needs to pre-cache a bunch of very expensive computations (map tiles over data changing on an interval).
Because this is CPU-heavy and Node.js is single-threaded, it kills the user experience while it is running. Interactive responses get interleaved with batch actions, and users complain.
This is not a problem with ASP.NET where this kind of work can simply run in a background thread and populate the cache without interfering with user queries!
For similar reasons, Redis replacements that use multi-threading have far lower tail latencies: https://microsoft.github.io/garnet/
Someone familiar with this please confirm: if you have a node.js app, which is true single-threaded, do people run multiple copies per physical CPU, or do they just max out one core and leave the rest idle? Or lease the cores separately from their hosting provider by running a bunch of one-core container instances or something?
Node supports 2 ways of getting an additional event loop threads (this being the “single-threaded” part that people talk about with Node often without understanding much about its internals, as the Node process itself spawns many threads in the background).
The first mode is child process: the main process forks an entirely separate instance of Node with its own event loop, which you communicate with over IPC or some network socket.
The second mode (introduced fairly recently) is the ability to spin off worker threads which have their own event loop but share the worker thread pool of the main process. I think there is a way to share memory between these threads via some special type of buffer, but I have never used them.
The first mode maps directly to the idea of micro-services, just running on the same machine. This is why it is not really used AFAIK in modern cloud based apps, with single core micro service instances used instead. This approach has a higher latency cost but allows cheaper instances and much simpler services - it very much depends on the use case to decide if that is correct choice or not.
At my company we use node's cluster module, configured to effectively run a child process for each physical CPU on our Azure App Service.
We have a load balancer in front of that to scale app service instances.
I mentally translate “modern web development” to: proxies for the proxies and layers of load balancers upon load balancers to make the envoys work with the ingress, all through an API management layer for good measure… and then a CDN.
“Why is our app so slow?”
“It’s a mystery. Just scale it out some more!”
It’s a common architecture antipattern to try to run non-interactive tasks on the same resources as interactive ones.
Even if you avoid the initial problems by punching the CPU priority through the floor, someone will eventually introduce a bug/feature that increases IOPS or memory usage drastically, or finds some other way to accomplish priority inversion. You will think you can deploy this code any time you want and discover that you can’t.
And having survived the first and second crises, you will arrive at the final one: a low-priority process is not a no-priority process. People will expect it to be completed “on time”. And as the other processes saturate the available resources via either code bloat or right sizing, now your background task isn’t completing the way people have come to expect it to complete. It stops being able to be “free” by parasitism and has to have its own hardware.
> It's not wrong, it's just over-simplified.
I would say if you simplify this much it becomes just plain wrong.
And the reason given is:
"it brings complexity very few developers understand"
Someone tell this guy about GPUs ...
Or games. Or web servers. I think the author might have forgotten to suffix the title with “for the work I do”.
Web servers are notoriously easy to implement with message passing / fork. Threading is a particularly bad model for that case.
Games on the other hand? Oh yeah, you definitely need threads for that.
It's important to note that "multithreading" specifically means shared-memory model of parallel processing. Games are more of an exception than the rule when it comes to being well suited to shared memory.
Isn’t Nginx multithreaded rather than multiprocess? I’m not an expert in web stuff but it’s always felt intuitively that worker threads map nocely into typical web server workloads.
Also, any UI app should basically be multithreaded to prevent interface hiccups.
No, nginx uses a fixed number of worker processes and communicates primarily through explicit shared memory and message passing.
Threading works ok, but there's rarely much shared state so the fork and pipe / she model wins cleaner and more secure.
As for UIs: maybe! I'm sure it depends, but this isn't an area I know a ton about
Interesting perspective, I’ve never been thinking about this problem in terms of whether workers need shared memory space. Thank you!
yeah, definitely. I do GPU coding at work and that's hell. Optimizing the memory saturation and cpu saturation is maddening and always comes with surprises.
(but when it works, it's so gratifying! the GPU go very very fast)
Not sure why this is on the HN front page. Post is riddled with errors, although it mixes in a few in-some-cases truths as well.
Oh, and [2023].
This guy's patronizing rants has made me avoid his software (µWebSockets) even though it seems good. Stop trying to be Torvalds. Even the docs used to be riddled with it:
> In the 1970s, programming was an elite's task. Today programming is done by uneducated "farmers" and as a result, the care for smart algorithms, memory usage, CPU-time usage and the like has dwindled in comparison.
Forgetting the tone for a moment but I can say I see lots of people using things they don't understand and making hard problems for themselves. To be fair I know that I don't understand those things properly and yet see people rushing in to use them who know even less.
With threads it was a messaging service that was supposed to offer a persistent queue that could survive restarts. It possibly doubled or trebled the length of the project through multithreading bugs. I wasn't the creator of the code - it was someone without a degree. In the end I had to solve one bug on it that took 3 months to work out and that was just a double-free in some odd circumstance. Nobody wanted to touch it and muggins (i.e. me) was the last person without an excuse!
Asynchronous python and python threading are the recent ones I've experienced - javacript programmers who decided that python was trivial to learn and tried to speed everything up with threads (makes everything worse until 3.13 with a special compilation option that we could not use) and then they made life even worse to no purpose at all by using async without knowing that the ASGI system underneath didn't support it properly. Uvicorn does but uvicorn wasn't usable in that context.
Apart from creating wonderful opportunities for bugs they didn't even know how to write async unit tests so the tests always passed no matter what you did to them.
When trying to help with these issues I found the attitude to be extremely resistant. No way they were going to listen to me - the annoying whippersnapper in one case or the old fart programmer in the other. They just knew better
Multi-Threading is the worst solution, except for all the other solutions.
Avoiding multi-threading doesn't remove concurrency issues. It just moves them to a different point in the application execution. A point where you don't have debuggers and need to create overly fault-tolerant behavior for everything. This is bad for performance but worse for debugging. With a regular synchronous threaded application I have a clean, obvious stack trace to a failure in most cases. Asynchronous or process based code gives me almost nothing for production failures or even regular debugging.
Using something you don't actually need is wasteful. When you don't need threads, don't use them. Trivially parallelizable problems should be trivially parallelized.
Sometimes you want to do something complex in a short time though. If latency matters, sometimes you don't have a choice. Running 10 instances of the same game at 10fps each is not the answer.
Bit dizzy from the praise of Node.js and complaints about cache invalidation in one rant.
> Again, say what you want about Node.js, but it does have this thing right.
Async/promise/deferred code is just re-implementing separate control callback chains that are not that different than threads when talking about IO. You'll still need mutexes, semaphores and such.
That's why there are things like async-mutex, and it's not just a Javascript problem, Python's Twisted also has these: https://docs.twistedmatrix.com/en/stable/api/twisted.interne... DeferredLock and DeferredSemaphore.
> The best design is the one where complexity is kept minimal, and where locality is kept maximum. That is where you get to write code that is easy to understand without having these bottomless holes of mindbogglingly complex CPU-dependent memory barrier behaviors. These designs are the easiest to deploy and write. You just make your load balancer cut the problem in isolated sections and spawn as many threads or processes of your entire single threaded program as needed
Wholeheartedly agree. That's exactly how Elixir and Erlang processes work, they are small, lightweight and have isolated heaps.
If the author wrote that sincerely, they should promptly drop Node.js on the floor and start learning Elixir.
> Say what you want about Node.js. It sucks, a lot.
I don't think so. V8 is a marvel of engineering. JS is the problem (quirky-to-downright-ugly language that is also extremely ubiquitous), not Node.
> But it was made with one very accurate observation: multithreading sucks even more.
Was it made with that observation? (in that case I would like a source to corroborate that) Or was it simply that when all you have is hammer (single thread execution) then everything looks like a nail (you going to fix problems within that single thread).
The evented/async/non-blocking style of programming that you will need when serving a lot of requests from one thread, was already existing before Node. It was just not that popular. When you choose to employ this style of programming, all your heavy IOing libraries need to be build for it, and they usually were not.
Since Node had no other options, all their IO libs were evented/async/non-blocking from the get go. I don't think this was a design choice, but more a design requirement/necessity.
It is 2025, multithread scalability is a well understood, if not easy, problem.
The reality is that hardware-provided cache coherence is an extremely powerful paradigm. Building your application on top of message passing not only gives away some performance, but it means that if you have any sort of cross thread logical shared state that needs to be kept in sync, you have to implement cache coherence yourself, which is an extremely hard problem.
With my apologies to Greenspun, any sufficiently complicated distributed system contains an ad-hoc, informally-specified, bug-ridden, slow implementation of MESI.
But of course, if you have a trivially parallel problem, rejoice! You do not need much communication and shared memory is not as useful. But not all, or even most, problems are trivially parallel.
"It is 2025, multithread scalability is a well understood, if not easy, problem."
In the 1990s it became "well known" that threading is virtually impossible for mere mortals. But this is a classic case of misdiagnosis. The problem wasn't threading. The problem was a lock-based threading model, where threading is achieved by identifying "critical sections" and trying to craft a system of locks that lets many threads run around the entire program's memory space and operate simultaneously.
This becomes exponentially complex and essentially infeasible fairly quickly. Even the programs of the time that "work" contain numerous bombs in their state space, they've just been ground out by effort.
But that's not the only way to write threaded code. You can go full immutable like Haskell. You can go full actor like Erlang, where absolutely every variable is tied to an actor. You can write lock-based code in a way that you never have to take multiple simultaneous locks (which is where the real murder begins) by using other techniques like actors to avoid that. There's a variety of other safe techniques.
I like to say that these take writing multithreaded code from exponential to polynomial, and a rather small polynomial at that. No, it isn't free, but it doesn't have to be insane, doesn't take a wizard, and is something that can be taught and learned with only reasonable level of difficulty.
Indeed, when done correctly, it can be easier to understand that Node-style concurrency, which in the limit can start getting crazy with the requisite scheduling you may need to do. Sending a message to another actor is not that difficult to wrap your head around.
So the author is arguably correct, if you approach concurrency like it's 1999, but concurrency has moved on since then. Done properly, with time-tested techniques and safe practices, I find threaded concurrency much easier to deal with than async code, and generally higher performance too.
I see it the other way. I’ll admit that I do a lot of “embarrassingly parallel” problems where the answer is “Executor and chill” in Java. I have dealt with quite a few Scala systems that (1) didn’t get the same answer every time and (2) got a 250% speed up with 8 cores and such, and common problems where “error handling with monads theater”, “we are careful about initialization but could care less about teardown (monads again!)” [1] and actors.
The choice is between a few days of messing around with actors and it still doesn’t work and 20 minutes rewriting with Executors and done. The trick with threads is having a good set of primitives to work with and Java gives you that. In some areas of software the idea of composing a minimal set of operations really gets you somewhere, when it comes to threads it gets you to the painhouse,
I went through a phase of having a huge amount of fun writing little server/clients with async Python but switched to sync when the demands in CPU increased. The idea that “parallelism” and “concurrency” aren’t closely related is a bad idea like the alleged clean split between “authentication” and “authorization” —- Java is great because it gives you 100% adequate tools that handle parallelism and concurrency with the same paradigm.
[1] You could do error handling and teardown with monads but drunk on the alleged superiority of a new programming paradigm many people don’t —- so you meet the coders who travel from job to job like itinerant martial artists looking for functional programming enlightenment. TAOCP (Turing) stands the test of time whereas SICP (lambda calculus) is a fad.
I'm a huge fan of the actor model and message passing, so you do not have to sell it to me; I also strongly dislike the current async fad.
But message passing is not a panacea. Sometimes shared mutable state is the solution that is simplest to implement and reason about. If you think about it what are database if not shared mutable state, and they have been widely successful. The key is of course proper concurrency control abstractions.
"There's a variety of other safe techniques."
> multithread scalability is a well understood, if not easy, problem
As someone fairly well versed in MESI and cache optimization: it really isn't. It's a minority of people that understand it (and really, that need to).
> Building your application on top of message passing not only gives away some performance
This really isn't universally true either. If you're optimizing for throughput, pipelining with pinned threads + message passing is usually the way to go if the data model allows for it.
To be clear, I'm not claiming any universality. Quite the contrary, I'm saying that there is no silver bullet.
For modern large-scale production systems I agree except in cases where performance is critical. To choose a wild example - a microservice serving geospatial queries over millions of objects that can exist anywhere in the world has plenty of parallelism at the level of each query, but handling multiple queries can be done by scaling horizontally with multiple instances of the service.
Instead of "just throw it on a thread and forget about it" - in a production environment, use the job queue. You gain isolation and observability - you can see the job parameters and know nothing else came across, except data from the DB etc.
Younger me tried saturating CPU cores with event loops for a few years. Then I grew up and started using long living threads and queues to consolidate everything on the main thread and it works a whole lot better for me. Async/await shit show is gone, synchronization primitives you still somehow need for some reason with an event loop are mostly gone, all interfaces use the same synch stack, it is paradise. It almost feels like you have to play around with async to really appreciate the simple and solid thread safe queue approach.
Or you know, you could just use .NET?
I don't think it's useful to think like this when writing software. Sure you must do more work when going multi threaded. But that doesn't mean you are slower in wall clock time. And wall clock time is the cool kid.
I mean if you go this route then you may as well say zero-copy doesn't exist. Everytime you move things between registers things are copied. I guess OP also disables all their cores and runs their OS on a single core. It's more efficient after all. I take the other view. The more effective CPU time you can use the better for a ton of non UI use cases. So I would say it is more efficient to use 2x the CPU time to reduce wall clock time by 10% for example. In fact any CPU time that is unused is inefficient in a way. It's just sitting there unused, forever lost to time.
Every time I try to increase the performance of my software by using multiple cores, I need a lot of cores to compensate for the loss of per-core efficiency. Like, it might run 2-3 times as fast on 8 cores.
I'm sure I've been doing it wrong. I just had better luck optimizing the performance per core rather than trying to spread the load over multiple cores.
Or your task needs the overhead to sync and read/write data. Only you can tell really with access to code/data, but 3x speed on 8x cores may well be the theoretical maximum you can do for this specific thing.
Synchronization overhead is more than people think, and it can be difficult to tell when you're RAM/cache-bandwidth limited. But it makes a difference if you can make the "unit of work" large enough.
Come on. This is a "mongo web scale" type of article.
CPU bound applications MUST use multithreading to be able to utilize multiple cores. In many cases, the framework knows how to give an API to the developer which masks the need for him to deal with setting up a worker thread pool, such as with web applications frameworks - but eventually you need one.
Learn how to be an engineer, and use the right solution for the problem.
What is his definition of "multi-threading"? Did he specify it as that frames the entire discussion. I took a quick search but saw no mention of it but might have missed where he discusses it.
shared-memory multiprocessing.
Since many comments here are critical of this article, are there any better sources on this topic? If yes, please share here. Thank you.
but why do my cpu is at 1600% load all the time?
Mostly nonsense with a clickbait title.
(2023)
Dude got eviscerated in his own comment section. I don't think more really needs to be said.
[dead]