This. They're shouting at the sky from whatever squirrels are running around their brains without understanding anything about real CPUs. It's somewhere between a shitpost and AI slop. They offered no better alternative except to bitch about how "imperfect" things are that work. I'd like to see their FOSHW RTL design, test bench, and formal verification for a green field multicore, pipelined, superscalar, SIMD RISC design competitive to RISC-V RV64GC or Arm 64.
The sad thing is that nobody is going to buy these chips aside from niche products. You won't see them run a desktop or a smartphone, because if a CPU doesn't have a C compiler or a Javascript JIT "out of the box" so to speak, it has no hope to enter mass market products.
Even Java had no chance. Some older ARM CPUs used to support for "Java bytecode execution" [3] but it never saw adoption.
Your comment makes me think that there is some truth in TFA's title: a trait of many mental illnesses is that the person is not aware they have it - the Stockholm syndrome is probably not a mental illness, though. See what you are doing: "gatekeep" alternatives that don't play by the rules of the dominant architectures ("pipelined, superscalar SIMD RISC design").
[2] Website currently has a certificate problem. IIRC, GA144 is a 144 cores asynchronous CPU. The cores' ISA is based on the primitives of the Forth language: next to no registers, instruction manipulating data on a data stack instead, separate from the call/return stack.
Jazelle shambled a lot further into the future than any of us want to admit. I think there are still a few ARM chips in production with java bytecode executuin still tacked on. Most of them only dropped it a few years ago.
Hasn't most code that's been compiled in the last few decades using the x86 frame pointer register (ebp) as a regular register? And C also worked just fine on CPUs that didn't have a dedicated frame pointer.
AFAIK the concepts of the stack and 'subroutine call instructions' existed before C because those concepts are also useful when writing assembly code (subroutines which return to the caller location are about the only way to isolate and share reusable pieces of code, and a stack is useful for deep call hierarchies - for instance some very old CPUs with a dedicated on-chip call-stack or 'return register' only allowed a limited call-depth or even only a single call-depth).
Also it's not like radical approaches in CPU designs are not tried all the time, they just usually fail because hardware and software are not developed in a vacuum, they heavily depend on each other. How much C played a role in that symbiosis can be argued about, but the internal design of modern CPUs doesn't have much in common with their ISA (which is more or less just a thin compatibility wrapper around the actual CPU and really doesn't take up much CPU space).
It's because the historical perspective in this article is really lacking, despite that being the premise of the article.
Not only does the author seem to believe that C was the first popular high-level language, but the claim that hardware provided stack based CALL and RETURN instructions was not universally true. Many systems had no call/return stack, or supported only a single level of subroutine calls (essentially useless for a high level language but maybe useful for some hand written machine code).
FORTRAN compilers often worked by dedicating certain statically allocated RAM locations near a function's machine code for the arguments, return value, and return address for that function. So a function call involved writing to those locations and then performing a branch instruction (not a CALL). This worked absolutely fine if you didn't need recursion. The real benefit of a stack is to support recursion.
I recall starting to program in BASIC on CP/M and ZX-Spectrum machines and they didn't have procedures, only GOTO. Just like assembler, you can use all the JMP you want and not use structured programming and procedures but ... it will all become an unmaintainable mess in short time.
Very likely in a number of alternate futures (if not all of them), given the original set of CPU instructions, people would gravitate naturally to C and not some GOTO spaghetti or message passing or object oriented whatever.
I remember jumping out of gosub on the Apple ][ and eventually running into an out of memory error as the stack on the 255 byte page $01 overflowed. As math was also done on the stack math functions broke. Simplifying expressions only delayed the inevitable doom. I had to abandon the project and only later understood my first encounter with a memory leak.
The BASICs of the time often had GOSUB which remembered the return address. Also, while the stack not being prominent in the high level language, they very well used one on the assembly level. For example on the C64 (probably other 6502/6510 based systems) it always started at $100 ($ being the old convention of writing hex), right after the zeropage.
Yeah, I think it was but I was just starting in programming and GOSUB didn't made much sense to me as it's not a proper structured programming procedure with input and output params but more like a hack with data passed through global variables. And by the time I learned structured programming I already moved to Pascal (HiSoft Pascal : https://www.cpcwiki.eu/index.php/Hisoft_Pascal ) and there was no turning back to BASIC.
Funny thing is, dedicated call stacks made a return with shadow stacks to mitigate ROP attacks, so now we do this double accounting to check if we are about to return where we should return to.
"AFAIK the concepts of the stack and 'subroutine call instructions' existed before C because those concepts are also useful when writing assembly code [..]"
The stack principle was patented in 1957 by Friedrich L. Bauer.
It was well known before C but in 1957 it was considered noteworthy enough to justify a patent.
You raise fair points about the history and flexibility of hardware design. Stacks and subroutines definitely predate C, and C has run well on many architectures without dedicated frame pointers. Those are valid technical observations.
But the article is about conceptual lock-in: how decades of mutual optimization between C-style abstractions and mainstream CPUs have made certain design assumptions feel "natural" or "inevitable." The author’s "Stockholm Syndrome" metaphor is provocative, sure, but it points to that inertia, not to a claim that hardware can’t do anything else.
So I'd say your comment addresses implementation details that are historically accurate, while the article is pointing to the sociotechnical feedback loop: how co-evolution between C and hardware subtly shapes what we think of as efficient or possible.
In my opinion, your point about radical CPU designs failing actually reinforces the article's argument more than it refutes it. The very reason such designs tend to fail is that the entire ecosystem of compilers, operating systems, and developer habits has been optimized around C-like models of computation. In other words, the co-evolution you describe is precisely the "Stockholm Syndrome" the author means. Hardware and software have adapted to each other so thoroughly that alternatives struggle to survive, not because they are inherently worse, but because they don’t fit the entrenched assumptions of the current hardware–software compact.
Yes! Finally a comment addressing the article. I think it would be nice to explore what the hardware would end up looking like if people had optimized it to mainly run Lisp, FORTRAN, Go, etc.
Would it have been better, faster, cheaper, more flexible, etc?
Just look at the Intel iAPX 432 [1] as example for such an alternative CPU design, this was supposed to be the actual 8080 successor and 8086 was just supposed to be a temporary throwaway solution until the 432 was ready. The rest is history as they say ;)
Meanwhile, the C programming model turned out a pretty good fit for very different hardware architectures. All 3D API shader languages have a C heritage for instance, despite GPUs being radically different than traditional CPUs. In the end only three things matter in hardware design: throughput, throughput and throughput ;)
Get yourself a FPGA dev board and get cracking. You can get useful ones for like $50 or less.
I've made a couple of simple soft-CPUs for self-designed instruction sets, aling with some compilers. Really fun to try to think of which instructions to include and how it interacts with the compiler.
The prevailing opinion seems to have shifted towards using the frame pointer for its special purpose, in order to improve debuggability/exception handling?
But the article is really quite misinformed. As you say, many mainframe/mini, and some early microprocessor architectures don't have the concept of a stack pointer register at all, neither do "pure" RISCs.
I'd argue there is a real "C Stockholm Syndrome" though, particularly with the idea of needing to use a single "calling convention". x86 had - and still has - a return instruction that can also remove arguments from the stack. But C couldn't use it, because historically (before ANSI C), every single function could take a variable number of arguments, and even nowadays function prototypes are optional and come from header files that are simply textually #included into the code.
So every function call using this convention had to be followed by "add sp,n" to remove the pushed arguments, instead of performing the same operation as part of the return instruction itself. That's 3 extra bytes for every call that simply wouldn't have to be there if the CPU architecture's features were used properly.
And because operating system and critical libraries "must" be written in C, that's just a fundamental physical law or something you see, and we have to interface with them everywhere in our programs, and it's too complicated to keep track of which of the functions we have to call are using this brain-damaged convention, the entire (non-DOS/Windows) ecosystem decided to standardize on it!
Probably as a result, Intel and AMD even neglected optimizing this instruction. So now it's a legacy feature that you shouldn't use anymore if you want your code to run fast. Even though a normal RET isn't "RISC-like" either, and you can bet it's handled as efficiently as possible.
Obviously x86-64 has more registers now, so most of the time we can get away without pushing anything on the stack. This time it's actually Windows (and UEFI) which has the more braindead calling convention, with every function being required to reserve space on its stack for it's callee's to spill the arguments. Because they might be using varargs and need to access them in memory instead of registers.
And the stack pointer alignment nonsense, that is also there in the SysV ABI. See, C compilers like to emit hundreds of vector instructions instead of a single "rep movsb", since it's a bit faster. Because "everything" is written in C, this removed any incentive to improve this crusty legacy instruction, and even when it finally was improved, the vector instructions were still ahead by a few percent.
To use the fastest vector instructions, everything needs to be aligned to 16 bytes instead of 8. That can be done with a single "and rsp,-16" that you could place in the prologue of any function using these instructions. But because "everything" uses these instructions, why not make it a required part of the calling convention?
So now both SysV and Windows/UEFI mandate that before every call, the stack has to be aligned, so that the call instruction misaligns it, so that the function prologue knows that pushing an odd number of registers (like the frame pointer) will align it again. All to save that single "and rsp,-16" in certain cases.
Given this is rewinding to the 1970s, I expected a mention of CSP [0], or Transputers [1], or systolic arrays [2], or Connection Machines [3], or... The history wasn't quite as one-dimensional as this makes it seem.
"In the beginning, CPUs gave you the basics: registers, memory access, CALL and RETURN instructions."
Well, CALL and RETURN need a stack: RETURN would need an address to return to. So there you go.
A concept of subroutine was definitely not introduced by C. It was an essential part of older languages like Algol and Fortran, and is inherently a good way to organize computation. E.g the idea is that you can implement matrix multiplication subroutine just once and then call it every time you need to multiply matrices. That was absolutely a staple of programming back in the day.
Synchronous calls offer a simple memory management convention: caller takes care of data structures passed to callee. If caller's state is not maintained then you need to take care of allocated data in some other way, e.g. introduce GC. So synchronous calls are the simpler, less opinionated option.
> Well, CALL and RETURN need a stack: RETURN would need an address to return to. So there you go.
I agree with the sentiment but you're making an impermissible contraction here. Look at how ARM calls work—the instructions are "bl <addr>" (branch and link) and "bx lr" (branch to register, lr). "bl" just stores the return address in the link register. The stack is something the machine code does on top of that. Of course the CPUs are optimised for standard ways of doing that, but it's not inherent to the system.
It’s interesting that not a single hardware concept is actually explored - nothing like a real alternative architecture or how one would program it. Just a lot of complaints about standard (sounds like x86 only at that) techniques without much depth.
You see innovation in this space a lot in research. For example, Dalorex [1] or Azul [2] for operations on sparse matrices. Currently a more general version of Azul is being developed at MIT with support for arbitrary matrix/graph algorithms with reconfigurable logic.
> …and has more computing power than the machines that sent humans to the moon.
We’ve all read this comparison many times. Is there any reason to think of sending people to the moon as a difficult computational problem though? Conversely, just because relatively little computation was needed to send someone to the moon does that necessarily mean that more computational power is a necessary path to doing something we’d all agree is more impressive, e.g. sending someone to Mars?
If you disallow recursion, or put an upper bound on recursion depth, you can statically allocate all "stack based" objects at compile time. Modula I did this, allowing multithreading on machines which did not have enough memory to allow stack growth. Statically analyzing worst case stack depth is still a thing in real time control.
Not clear that this would speed things up much today.
It's too constrained to statically-analyze the whole program to determine required stack size. You can't have calls via function pointers (even virtual calls), can't call thirdparty code without sources available, can't call any code from a shared library (since analyzing it is impossible). This may work only in very constrained environments, like in embedded world.
Mostly.
Analyzing a shared library is possible. All you need is call graph and stack depth.
If function pointers are type constrained, the set of callable functions is finite.
I think you would also need to ban threading and reentrancy (from signals)? Also, if you had function pointers, you would have to pessimize to storing each such function's stack memory separately I think (since the compiler wouldn't be able to figure out the call graph, in general)
> Not clear that this would speed things up much today.
It would likely slow things down, actually.
One advantage that a stack has is that the stack memory is "hot"--it is very likely in cache due to the fact that it is all right next to one another and used over and over and over. Statically allocated memory, by contrast, has no such guarantee and every function call is likely to be "cold" and need to pull in the cache line with its static variables.
Pretty clueless. When hardware with a new paradigm is invented that provides value, new software has been created for it. GPUs are a great example, TPUs/systolic arrays and the Cerebras device are others; all of these are pretty successful. CPUs architecture is what it is because it's the best approach anyone has come up with for it's domain, and C is built around that, not the other way around. If you want to claim CPUs can be done better, you're going to give something much more concrete than vague gesturing at how things could be different. How, specifically, do you intend to build an abstraction around Pong's approach? What does the programming model look like? Seems to me that this is a pretty difficult way to build anything, and Pong only managed out of its simplicity.
Article doesn't make much sense, rambles on about frame stacks and calling convention etc, when these don't matter very much on modern hardware, and are hardly standardized(windows and linux don't use the same CC).
Modern hardware is about keeping the compute as busy as possible with as fast as possible access to memory, pretty much the opposite of the proposed solution..message passing.
One of the big things this article fails to mention is that TDP/heat budget is way more of a constraint than number of transistors - at small feature size, silicon is (relatively) cheap, power isn't.
There's no way you can use 100% of your CPU - it would instantly overheat. So it suddenly makes even more sense to have optimised hardware units for all sorts of processes (h264 encoding, crypto etc) if you can do a task any more efficiently than basic logic.
Check out Transputers, that were programmed via Occam. They do most of the stuff that the article desires. Though its hardware is restricted to a matrix orientation.
Another option is Erlang. On the top level it is organized with micro-services instead of functions.
None of them are system languages. The old hardware had weird data and memory formats. With C a lot of assembler could be avoided to program this hardware. It came as a default with Unix and some other operating systems. Fortran and Pascal were kind of similar.
The most used default languages on most systems were for interpreters. So you got LISP and BASIC. There is no fast hardware for that. To get stuff fast, one needed to program assembler, unless there was a C-compiler available.
The article (and a lot of comments here) confuse C with the psABI (platform specific ABI, which is what really defines conventions —more than just calling— for executable code.)
Due to history, yes, most psABIs are "do what C does". But the real problem isn't C rigidly frozen, it's psABIs designed around these "classical" programming models. E.g. none of the Linux psABIs even have a concept of "message passing", let alone more creative deviations from 'good ole' imperative code.
You could say the same about JavaScript: JS is "fast" today, even though the language itself is hilariously inefficient - but browser vendors invested an ungodly amount of work into optimizing their engines, solving all kinds of crazy problems that wouldn't have been there in the first place if the language had been designed differently, until execution is now good enough that you can even use it for high-throughput server-side tasks.
Would some message passing hardware actually be "better" in terms of performance, efficiency, and ease of construction? I thought moving data between hardware subsystems is generally pretty expensive (e.g. atomic memory instructions, there's a significant performance penalty).
Disclaimer that I'm not a hardware engineer though.
Complete fantasy rubbish from typical software thinking. Skipped over all the reasons we no longer wire a Z80 directly to static RAM, and what that precipitates.
This is one of those "what if we're doing everything wrong and there's a better alternative?" articles that is entirely useless because it doesn't propose a single alternative.
The article is no more about transistors or literal frame-pointer hardware than a discussion about the QWERTY keyboard layout is about the metallurgy and mechanics of typewriter arms.
It's about how early design choices, once reinforced by tooling and habit, shape the whole ecosystem's assumptions about what’s "normal" or "efficient."
The only logical refutation of the article would be to demonstrate that any other computational paradigm, such as dataflow, message-passing, continuation-based, logic, actor, whatever, can execute on commodity CPUs with the same efficiency as imperative C-style code.
Saying "modern CPUs don't have stack hardware" is a bit like saying "modern keyboards don't jam, so QWERTY isn't a problem."
True, but beside the point. The argument isn't that QWERTY (or C) is technically flawed, but that decades of co-evolution have made their conventions invisible, and that invisibility limits how we imagine alternatives.
The author's Stockholm Syndrome metaphor isn't claiming we can’t build other kinds of CPUs. Of course we can. It's pointing out how our collective sense of what computing should look like has been quietly standardized, much like how QWERTY standardized how we type.
Saying that "modern CPUs are mostly caches and vector units" is like saying modern keyboards don't have typebars that jam. Technically true, but it misses that we're still typing on layouts designed for those constraints.
Dismissing the critique as fighting 1980s battles is like saying nobody uses typewriters anymore, so QWERTY doesn’t matter.
Pointing out that C works fine on architectures without a frame pointer is like noting that Dvorak and Colemak exist. Yes but it ignores how systemic inertia keeps alternatives niche.
The argument that radical CPU designs fail because hardware and software co-evolve fits the analogy: people have tried new keyboard layouts too, but they rarely succeed because everything from muscle memory to software assumes QWERTY.
The claim that CPU internals are now nothing like their ISA is just like saying keyboards use digital scanning instead of levers. True, but irrelevant to the surface conventions that still shape how we interact with them.
This dismissive pile-on validates the article's main metaphor of Stockholm Syndrome surprisingly directly!
The article is railing against how things are without offering even a glimpse of what could be improved in an alternate design (other than some nebulous talk about message passing - which I'm assuming he's meaning to be something akin to FPGAs?)
The alternatives are niche because they're not compelling replacements. A Colemak keyboard isn't going to improve my productivity enough to matter.
A DSP improves performance enough to matter. A hardware video decoding circuit improves performance enough to matter. A GPU improves performance enough to matter. Thus, they exist and are mainstream.
When we've found better abstractions that actually make a compelling difference, we've implemented them. Modern programming languages like Kotlin have advanced enough abstractions that they could actually be applied in exotically architectured CPUs. And yet such things are not used.
Big players with with their own silicon like Apple and Google aren't sticking to the current general CPU architecture out of stubbornness. They look at the bottom line.
And the bottom line is that modern CPUs are efficient enough at their tasks that no alternative has disrupted them yet.
The article argues that our very sense of what counts as "better" or "compelling" is shaped by the assumptions baked into C-style hardware. Saying no alternative has disrupted them because modern CPUs are efficient enough just restates that bias. It assumes the current definition of efficiency is neutral, when that's exactly what’s being questioned.
The examples of GPUs, DSPs, and hardware video decoders don’t really contradict the article's point. Those are domain-specific accelerators that still live comfortably inside the same sequential, imperative model the author critiques. They expand the ecosystem, but don't escape its paradigm.
Your Colemak analogy cuts the other way: alternatives remain niche because the surrounding ecosystem of software, conventions, and training makes switching costly whether or not they are actually "better." That is the path dependence the article calls out.
As to the article's not proposing an alternative, it reads like a diagnosis of conceptual lock-in, not a design proposal. Its point is to highlight how tightly our notion of "good design" is bound to one lineage of thought. It is explicitly labeled as part two, so it may be laying groundwork for later design discussion. In any case, I think calling attention to invisible constraints is valuable in itself.
> The article argues that our very sense of what counts as "better" or "compelling" is shaped by the assumptions baked into C-style hardware.
And this is where I disagree. History is rife with disruptive technologies that blew the existing systems out of the water.
When we do find compelling efficiencies in new designs, we adopt them, such as DSPs and GPUs - which are NOT sequential or imperative - they are functional and internally parallel, and offer massive real-world gains; thus their success in the marketplace.
We also experiment with new ways of computing, such as quantum computers.
There's no shortage of attempts to disrupt the CPU, but none caught on because none were able to show compelling efficiencies over the status-quo, same as for Colemak and DVORAK: they are technically more efficient, but there's not enough of a real-world difference to justify the switching cost.
And that's fine. I don't want to be disruptively changing things at a fundamental level just for a few percent improvement in real-world efficiencies. And neither are the big boys who are actively developing not only their own silicon, but also their own software to go with it.
The article itself reads a lot like a Post hoc ergo propter hoc, in that it only allows for the path of technological progress to exist within the bounds of the C programming language (while also misattributing a number of things to C in the process), but completely discounts the possibility that how CPUs are designed is in fact a very efficient way to do general purpose computing.
Market success does not necessarily prove conceptual neutrality; it just shows which designs fit best within the existing ecosystem of compilers, toolchains, and developer expectations. That is the lock-in the article describes.
Calling the argument post hoc ergo propter hoc also misses the mark. The author is not saying CPUs look this way because of C in a simple cause-effect sense, but that C-style abstractions and hardware co-evolved in a feedback loop that reinforced each other’s assumptions about efficiency.
And I do not think anyone is advocating "disruption for disruption's sake." The point is that our definition of what counts as a worthwhile improvement is already conditioned by that same co-evolution, which makes truly different paradigms seem uneconomical long before they're fully explored.
We'll have to agree to disagree, then. Technologies such as the GPU provided such massive improvements that you either had to get on-board or be left behind. It was the same with assembly line vehicle production, and then robot vehicle production. Some technological enhancements are so significant that disruption is inevitable, despite current "lock-ins".
And that's fine. We're never going to reach 100% efficiency in anything, ever (or 90% for that matter). We're always going to go with what works now, and what requires the least amount of retooling - UNLESS it's such a radical efficiency change that we simply must go along. THOSE are the innovations people actually care about. The 10-20% efficiency improvements, not so much (and rightly so).
I agree with the conclusion, but nothing about the arguments to arrive at that conclusion sound true at all.
Hardware does not have a Stockholm Syndrome. Making Chips that are better at things people use these Chips for is the correct thing to do. The CPU architecture is relatively static because making hardware for software which does not exist is a terrible idea. Look at Itanium, a total failure.
The C paradigm which the author bemoans never was the only one, but it was good, it worked well and that is why it is the default to this day. It is also good enough to emulate message passing and parallelism, it is totally non-obvious that throwing out this paradigm is in any way beneficial over the current status quo. There is no reason for the abstractions we want to also to be the abstractions the hardware uses.
> Look at a modern CPU die. See those frame pointer registers? That stack management hardware? That’s real estate. Silicon. Transistor budget.
You don't. What you see is caches, lots of caches. Huge vector register files. Massive TLBs.
Get real - learn something about modern chips and stop fighting 1980s battles.
You forgot about branch predictors.
This. They're shouting at the sky from whatever squirrels are running around their brains without understanding anything about real CPUs. It's somewhere between a shitpost and AI slop. They offered no better alternative except to bitch about how "imperfect" things are that work. I'd like to see their FOSHW RTL design, test bench, and formal verification for a green field multicore, pipelined, superscalar, SIMD RISC design competitive to RISC-V RV64GC or Arm 64.
Maybe Mill' CPU [1] or GreenArray's CPU [2] ?
The sad thing is that nobody is going to buy these chips aside from niche products. You won't see them run a desktop or a smartphone, because if a CPU doesn't have a C compiler or a Javascript JIT "out of the box" so to speak, it has no hope to enter mass market products.
Even Java had no chance. Some older ARM CPUs used to support for "Java bytecode execution" [3] but it never saw adoption.
Your comment makes me think that there is some truth in TFA's title: a trait of many mental illnesses is that the person is not aware they have it - the Stockholm syndrome is probably not a mental illness, though. See what you are doing: "gatekeep" alternatives that don't play by the rules of the dominant architectures ("pipelined, superscalar SIMD RISC design").
[1] https://millcomputing.com/
[2] Website currently has a certificate problem. IIRC, GA144 is a 144 cores asynchronous CPU. The cores' ISA is based on the primitives of the Forth language: next to no registers, instruction manipulating data on a data stack instead, separate from the call/return stack.
[3] https://en.wikipedia.org/wiki/Jazelle
Jazelle shambled a lot further into the future than any of us want to admit. I think there are still a few ARM chips in production with java bytecode executuin still tacked on. Most of them only dropped it a few years ago.
I mean, you can’t even see a single frame pointer register. The srams dominate everything else.
Even the functional units are relatively cheap.
Weird article.
Hasn't most code that's been compiled in the last few decades using the x86 frame pointer register (ebp) as a regular register? And C also worked just fine on CPUs that didn't have a dedicated frame pointer.
AFAIK the concepts of the stack and 'subroutine call instructions' existed before C because those concepts are also useful when writing assembly code (subroutines which return to the caller location are about the only way to isolate and share reusable pieces of code, and a stack is useful for deep call hierarchies - for instance some very old CPUs with a dedicated on-chip call-stack or 'return register' only allowed a limited call-depth or even only a single call-depth).
Also it's not like radical approaches in CPU designs are not tried all the time, they just usually fail because hardware and software are not developed in a vacuum, they heavily depend on each other. How much C played a role in that symbiosis can be argued about, but the internal design of modern CPUs doesn't have much in common with their ISA (which is more or less just a thin compatibility wrapper around the actual CPU and really doesn't take up much CPU space).
It's because the historical perspective in this article is really lacking, despite that being the premise of the article.
Not only does the author seem to believe that C was the first popular high-level language, but the claim that hardware provided stack based CALL and RETURN instructions was not universally true. Many systems had no call/return stack, or supported only a single level of subroutine calls (essentially useless for a high level language but maybe useful for some hand written machine code).
FORTRAN compilers often worked by dedicating certain statically allocated RAM locations near a function's machine code for the arguments, return value, and return address for that function. So a function call involved writing to those locations and then performing a branch instruction (not a CALL). This worked absolutely fine if you didn't need recursion. The real benefit of a stack is to support recursion.
I recall starting to program in BASIC on CP/M and ZX-Spectrum machines and they didn't have procedures, only GOTO. Just like assembler, you can use all the JMP you want and not use structured programming and procedures but ... it will all become an unmaintainable mess in short time.
Very likely in a number of alternate futures (if not all of them), given the original set of CPU instructions, people would gravitate naturally to C and not some GOTO spaghetti or message passing or object oriented whatever.
I remember jumping out of gosub on the Apple ][ and eventually running into an out of memory error as the stack on the 255 byte page $01 overflowed. As math was also done on the stack math functions broke. Simplifying expressions only delayed the inevitable doom. I had to abandon the project and only later understood my first encounter with a memory leak.
The BASICs of the time often had GOSUB which remembered the return address. Also, while the stack not being prominent in the high level language, they very well used one on the assembly level. For example on the C64 (probably other 6502/6510 based systems) it always started at $100 ($ being the old convention of writing hex), right after the zeropage.
GOSUB was definitely there.
Yeah, I think it was but I was just starting in programming and GOSUB didn't made much sense to me as it's not a proper structured programming procedure with input and output params but more like a hack with data passed through global variables. And by the time I learned structured programming I already moved to Pascal (HiSoft Pascal : https://www.cpcwiki.eu/index.php/Hisoft_Pascal ) and there was no turning back to BASIC.
First thing Hisoft Pascal did when you ran it was ask how much memory you wanted to reserve for the stack. (Or was that only in HiSoft C? I had both).
C can work just fine in systems without a stack. A very modern example: eBPF.
Funny thing is, dedicated call stacks made a return with shadow stacks to mitigate ROP attacks, so now we do this double accounting to check if we are about to return where we should return to.
"AFAIK the concepts of the stack and 'subroutine call instructions' existed before C because those concepts are also useful when writing assembly code [..]"
The stack principle was patented in 1957 by Friedrich L. Bauer. It was well known before C but in 1957 it was considered noteworthy enough to justify a patent.
You raise fair points about the history and flexibility of hardware design. Stacks and subroutines definitely predate C, and C has run well on many architectures without dedicated frame pointers. Those are valid technical observations.
But the article is about conceptual lock-in: how decades of mutual optimization between C-style abstractions and mainstream CPUs have made certain design assumptions feel "natural" or "inevitable." The author’s "Stockholm Syndrome" metaphor is provocative, sure, but it points to that inertia, not to a claim that hardware can’t do anything else.
So I'd say your comment addresses implementation details that are historically accurate, while the article is pointing to the sociotechnical feedback loop: how co-evolution between C and hardware subtly shapes what we think of as efficient or possible.
In my opinion, your point about radical CPU designs failing actually reinforces the article's argument more than it refutes it. The very reason such designs tend to fail is that the entire ecosystem of compilers, operating systems, and developer habits has been optimized around C-like models of computation. In other words, the co-evolution you describe is precisely the "Stockholm Syndrome" the author means. Hardware and software have adapted to each other so thoroughly that alternatives struggle to survive, not because they are inherently worse, but because they don’t fit the entrenched assumptions of the current hardware–software compact.
Yes! Finally a comment addressing the article. I think it would be nice to explore what the hardware would end up looking like if people had optimized it to mainly run Lisp, FORTRAN, Go, etc.
Would it have been better, faster, cheaper, more flexible, etc?
Just look at the Intel iAPX 432 [1] as example for such an alternative CPU design, this was supposed to be the actual 8080 successor and 8086 was just supposed to be a temporary throwaway solution until the 432 was ready. The rest is history as they say ;)
Meanwhile, the C programming model turned out a pretty good fit for very different hardware architectures. All 3D API shader languages have a C heritage for instance, despite GPUs being radically different than traditional CPUs. In the end only three things matter in hardware design: throughput, throughput and throughput ;)
[1] https://en.wikipedia.org/wiki/Intel_iAPX_432
Get yourself a FPGA dev board and get cracking. You can get useful ones for like $50 or less.
I've made a couple of simple soft-CPUs for self-designed instruction sets, aling with some compilers. Really fun to try to think of which instructions to include and how it interacts with the compiler.
The prevailing opinion seems to have shifted towards using the frame pointer for its special purpose, in order to improve debuggability/exception handling?
But the article is really quite misinformed. As you say, many mainframe/mini, and some early microprocessor architectures don't have the concept of a stack pointer register at all, neither do "pure" RISCs.
I'd argue there is a real "C Stockholm Syndrome" though, particularly with the idea of needing to use a single "calling convention". x86 had - and still has - a return instruction that can also remove arguments from the stack. But C couldn't use it, because historically (before ANSI C), every single function could take a variable number of arguments, and even nowadays function prototypes are optional and come from header files that are simply textually #included into the code.
So every function call using this convention had to be followed by "add sp,n" to remove the pushed arguments, instead of performing the same operation as part of the return instruction itself. That's 3 extra bytes for every call that simply wouldn't have to be there if the CPU architecture's features were used properly.
And because operating system and critical libraries "must" be written in C, that's just a fundamental physical law or something you see, and we have to interface with them everywhere in our programs, and it's too complicated to keep track of which of the functions we have to call are using this brain-damaged convention, the entire (non-DOS/Windows) ecosystem decided to standardize on it!
Probably as a result, Intel and AMD even neglected optimizing this instruction. So now it's a legacy feature that you shouldn't use anymore if you want your code to run fast. Even though a normal RET isn't "RISC-like" either, and you can bet it's handled as efficiently as possible.
Obviously x86-64 has more registers now, so most of the time we can get away without pushing anything on the stack. This time it's actually Windows (and UEFI) which has the more braindead calling convention, with every function being required to reserve space on its stack for it's callee's to spill the arguments. Because they might be using varargs and need to access them in memory instead of registers.
And the stack pointer alignment nonsense, that is also there in the SysV ABI. See, C compilers like to emit hundreds of vector instructions instead of a single "rep movsb", since it's a bit faster. Because "everything" is written in C, this removed any incentive to improve this crusty legacy instruction, and even when it finally was improved, the vector instructions were still ahead by a few percent.
To use the fastest vector instructions, everything needs to be aligned to 16 bytes instead of 8. That can be done with a single "and rsp,-16" that you could place in the prologue of any function using these instructions. But because "everything" uses these instructions, why not make it a required part of the calling convention?
So now both SysV and Windows/UEFI mandate that before every call, the stack has to be aligned, so that the call instruction misaligns it, so that the function prologue knows that pushing an odd number of registers (like the frame pointer) will align it again. All to save that single "and rsp,-16" in certain cases.
Given this is rewinding to the 1970s, I expected a mention of CSP [0], or Transputers [1], or systolic arrays [2], or Connection Machines [3], or... The history wasn't quite as one-dimensional as this makes it seem.
[0] https://en.wikipedia.org/wiki/Communicating_sequential_proce...
[1] https://en.wikipedia.org/wiki/Transputer
[2] https://en.wikipedia.org/wiki/Systolic_array
[3] https://en.wikipedia.org/wiki/Connection_Machine
XMOS still very much alive. But CSP comes with its own set of challenges which ironically have force one back to using shared memory.
As are systolic arrays, see Google's TPUs.
This is poorly written nonsense article.
"In the beginning, CPUs gave you the basics: registers, memory access, CALL and RETURN instructions."
Well, CALL and RETURN need a stack: RETURN would need an address to return to. So there you go.
A concept of subroutine was definitely not introduced by C. It was an essential part of older languages like Algol and Fortran, and is inherently a good way to organize computation. E.g the idea is that you can implement matrix multiplication subroutine just once and then call it every time you need to multiply matrices. That was absolutely a staple of programming back in the day.
Synchronous calls offer a simple memory management convention: caller takes care of data structures passed to callee. If caller's state is not maintained then you need to take care of allocated data in some other way, e.g. introduce GC. So synchronous calls are the simpler, less opinionated option.
> Well, CALL and RETURN need a stack: RETURN would need an address to return to. So there you go.
I agree with the sentiment but you're making an impermissible contraction here. Look at how ARM calls work—the instructions are "bl <addr>" (branch and link) and "bx lr" (branch to register, lr). "bl" just stores the return address in the link register. The stack is something the machine code does on top of that. Of course the CPUs are optimised for standard ways of doing that, but it's not inherent to the system.
We're talking about historic CPUs
ARMv1 in 1985 had branch and link [1].
[1] https://en.wikichip.org/wiki/arm/armv1
>Pong had no software. Zero. It was built entirely from hardware logic chips - flip-flops, counters, comparators. Massively parallel.
And that is what FPGA for.
This strikes me the author lack of hardware knowledge but still try to write a post about hardware.
It’s interesting that not a single hardware concept is actually explored - nothing like a real alternative architecture or how one would program it. Just a lot of complaints about standard (sounds like x86 only at that) techniques without much depth.
[dead]
You see innovation in this space a lot in research. For example, Dalorex [1] or Azul [2] for operations on sparse matrices. Currently a more general version of Azul is being developed at MIT with support for arbitrary matrix/graph algorithms with reconfigurable logic.
[1] https://ieeexplore.ieee.org/document/10071089 [2] https://ieeexplore.ieee.org/document/10764548
> …and has more computing power than the machines that sent humans to the moon.
We’ve all read this comparison many times. Is there any reason to think of sending people to the moon as a difficult computational problem though? Conversely, just because relatively little computation was needed to send someone to the moon does that necessarily mean that more computational power is a necessary path to doing something we’d all agree is more impressive, e.g. sending someone to Mars?
If you disallow recursion, or put an upper bound on recursion depth, you can statically allocate all "stack based" objects at compile time. Modula I did this, allowing multithreading on machines which did not have enough memory to allow stack growth. Statically analyzing worst case stack depth is still a thing in real time control.
Not clear that this would speed things up much today.
It's too constrained to statically-analyze the whole program to determine required stack size. You can't have calls via function pointers (even virtual calls), can't call thirdparty code without sources available, can't call any code from a shared library (since analyzing it is impossible). This may work only in very constrained environments, like in embedded world.
Mostly. Analyzing a shared library is possible. All you need is call graph and stack depth. If function pointers are type constrained, the set of callable functions is finite.
I think you would also need to ban threading and reentrancy (from signals)? Also, if you had function pointers, you would have to pessimize to storing each such function's stack memory separately I think (since the compiler wouldn't be able to figure out the call graph, in general)
> Not clear that this would speed things up much today.
It would likely slow things down, actually.
One advantage that a stack has is that the stack memory is "hot"--it is very likely in cache due to the fact that it is all right next to one another and used over and over and over. Statically allocated memory, by contrast, has no such guarantee and every function call is likely to be "cold" and need to pull in the cache line with its static variables.
Pretty clueless. When hardware with a new paradigm is invented that provides value, new software has been created for it. GPUs are a great example, TPUs/systolic arrays and the Cerebras device are others; all of these are pretty successful. CPUs architecture is what it is because it's the best approach anyone has come up with for it's domain, and C is built around that, not the other way around. If you want to claim CPUs can be done better, you're going to give something much more concrete than vague gesturing at how things could be different. How, specifically, do you intend to build an abstraction around Pong's approach? What does the programming model look like? Seems to me that this is a pretty difficult way to build anything, and Pong only managed out of its simplicity.
Article doesn't make much sense, rambles on about frame stacks and calling convention etc, when these don't matter very much on modern hardware, and are hardly standardized(windows and linux don't use the same CC).
Modern hardware is about keeping the compute as busy as possible with as fast as possible access to memory, pretty much the opposite of the proposed solution..message passing.
One of the big things this article fails to mention is that TDP/heat budget is way more of a constraint than number of transistors - at small feature size, silicon is (relatively) cheap, power isn't.
There's no way you can use 100% of your CPU - it would instantly overheat. So it suddenly makes even more sense to have optimised hardware units for all sorts of processes (h264 encoding, crypto etc) if you can do a task any more efficiently than basic logic.
Check out Transputers, that were programmed via Occam. They do most of the stuff that the article desires. Though its hardware is restricted to a matrix orientation.
Another option is Erlang. On the top level it is organized with micro-services instead of functions.
None of them are system languages. The old hardware had weird data and memory formats. With C a lot of assembler could be avoided to program this hardware. It came as a default with Unix and some other operating systems. Fortran and Pascal were kind of similar.
The most used default languages on most systems were for interpreters. So you got LISP and BASIC. There is no fast hardware for that. To get stuff fast, one needed to program assembler, unless there was a C-compiler available.
The article (and a lot of comments here) confuse C with the psABI (platform specific ABI, which is what really defines conventions —more than just calling— for executable code.)
Due to history, yes, most psABIs are "do what C does". But the real problem isn't C rigidly frozen, it's psABIs designed around these "classical" programming models. E.g. none of the Linux psABIs even have a concept of "message passing", let alone more creative deviations from 'good ole' imperative code.
You could say the same about JavaScript: JS is "fast" today, even though the language itself is hilariously inefficient - but browser vendors invested an ungodly amount of work into optimizing their engines, solving all kinds of crazy problems that wouldn't have been there in the first place if the language had been designed differently, until execution is now good enough that you can even use it for high-throughput server-side tasks.
Would some message passing hardware actually be "better" in terms of performance, efficiency, and ease of construction? I thought moving data between hardware subsystems is generally pretty expensive (e.g. atomic memory instructions, there's a significant performance penalty).
Disclaimer that I'm not a hardware engineer though.
Well, you can check out the iAPX432 to see an architecture with message passing hardware, and see how well it worked out in practice.
Might be relevant: https://yosefk.com/blog/the-high-level-cpu-challenge.html
Complete fantasy rubbish from typical software thinking. Skipped over all the reasons we no longer wire a Z80 directly to static RAM, and what that precipitates.
This is one of those "what if we're doing everything wrong and there's a better alternative?" articles that is entirely useless because it doesn't propose a single alternative.
The article is no more about transistors or literal frame-pointer hardware than a discussion about the QWERTY keyboard layout is about the metallurgy and mechanics of typewriter arms.
It's about how early design choices, once reinforced by tooling and habit, shape the whole ecosystem's assumptions about what’s "normal" or "efficient."
The only logical refutation of the article would be to demonstrate that any other computational paradigm, such as dataflow, message-passing, continuation-based, logic, actor, whatever, can execute on commodity CPUs with the same efficiency as imperative C-style code.
Saying "modern CPUs don't have stack hardware" is a bit like saying "modern keyboards don't jam, so QWERTY isn't a problem."
True, but beside the point. The argument isn't that QWERTY (or C) is technically flawed, but that decades of co-evolution have made their conventions invisible, and that invisibility limits how we imagine alternatives.
The author's Stockholm Syndrome metaphor isn't claiming we can’t build other kinds of CPUs. Of course we can. It's pointing out how our collective sense of what computing should look like has been quietly standardized, much like how QWERTY standardized how we type.
Saying that "modern CPUs are mostly caches and vector units" is like saying modern keyboards don't have typebars that jam. Technically true, but it misses that we're still typing on layouts designed for those constraints.
Dismissing the critique as fighting 1980s battles is like saying nobody uses typewriters anymore, so QWERTY doesn’t matter.
Pointing out that C works fine on architectures without a frame pointer is like noting that Dvorak and Colemak exist. Yes but it ignores how systemic inertia keeps alternatives niche.
The argument that radical CPU designs fail because hardware and software co-evolve fits the analogy: people have tried new keyboard layouts too, but they rarely succeed because everything from muscle memory to software assumes QWERTY.
The claim that CPU internals are now nothing like their ISA is just like saying keyboards use digital scanning instead of levers. True, but irrelevant to the surface conventions that still shape how we interact with them.
This dismissive pile-on validates the article's main metaphor of Stockholm Syndrome surprisingly directly!
The article is railing against how things are without offering even a glimpse of what could be improved in an alternate design (other than some nebulous talk about message passing - which I'm assuming he's meaning to be something akin to FPGAs?)
The alternatives are niche because they're not compelling replacements. A Colemak keyboard isn't going to improve my productivity enough to matter.
A DSP improves performance enough to matter. A hardware video decoding circuit improves performance enough to matter. A GPU improves performance enough to matter. Thus, they exist and are mainstream.
When we've found better abstractions that actually make a compelling difference, we've implemented them. Modern programming languages like Kotlin have advanced enough abstractions that they could actually be applied in exotically architectured CPUs. And yet such things are not used.
Big players with with their own silicon like Apple and Google aren't sticking to the current general CPU architecture out of stubbornness. They look at the bottom line.
And the bottom line is that modern CPUs are efficient enough at their tasks that no alternative has disrupted them yet.
The article argues that our very sense of what counts as "better" or "compelling" is shaped by the assumptions baked into C-style hardware. Saying no alternative has disrupted them because modern CPUs are efficient enough just restates that bias. It assumes the current definition of efficiency is neutral, when that's exactly what’s being questioned.
The examples of GPUs, DSPs, and hardware video decoders don’t really contradict the article's point. Those are domain-specific accelerators that still live comfortably inside the same sequential, imperative model the author critiques. They expand the ecosystem, but don't escape its paradigm.
Your Colemak analogy cuts the other way: alternatives remain niche because the surrounding ecosystem of software, conventions, and training makes switching costly whether or not they are actually "better." That is the path dependence the article calls out.
As to the article's not proposing an alternative, it reads like a diagnosis of conceptual lock-in, not a design proposal. Its point is to highlight how tightly our notion of "good design" is bound to one lineage of thought. It is explicitly labeled as part two, so it may be laying groundwork for later design discussion. In any case, I think calling attention to invisible constraints is valuable in itself.
> The article argues that our very sense of what counts as "better" or "compelling" is shaped by the assumptions baked into C-style hardware.
And this is where I disagree. History is rife with disruptive technologies that blew the existing systems out of the water.
When we do find compelling efficiencies in new designs, we adopt them, such as DSPs and GPUs - which are NOT sequential or imperative - they are functional and internally parallel, and offer massive real-world gains; thus their success in the marketplace.
We also experiment with new ways of computing, such as quantum computers.
There's no shortage of attempts to disrupt the CPU, but none caught on because none were able to show compelling efficiencies over the status-quo, same as for Colemak and DVORAK: they are technically more efficient, but there's not enough of a real-world difference to justify the switching cost.
And that's fine. I don't want to be disruptively changing things at a fundamental level just for a few percent improvement in real-world efficiencies. And neither are the big boys who are actively developing not only their own silicon, but also their own software to go with it.
The article itself reads a lot like a Post hoc ergo propter hoc, in that it only allows for the path of technological progress to exist within the bounds of the C programming language (while also misattributing a number of things to C in the process), but completely discounts the possibility that how CPUs are designed is in fact a very efficient way to do general purpose computing.
Market success does not necessarily prove conceptual neutrality; it just shows which designs fit best within the existing ecosystem of compilers, toolchains, and developer expectations. That is the lock-in the article describes.
Calling the argument post hoc ergo propter hoc also misses the mark. The author is not saying CPUs look this way because of C in a simple cause-effect sense, but that C-style abstractions and hardware co-evolved in a feedback loop that reinforced each other’s assumptions about efficiency.
And I do not think anyone is advocating "disruption for disruption's sake." The point is that our definition of what counts as a worthwhile improvement is already conditioned by that same co-evolution, which makes truly different paradigms seem uneconomical long before they're fully explored.
We'll have to agree to disagree, then. Technologies such as the GPU provided such massive improvements that you either had to get on-board or be left behind. It was the same with assembly line vehicle production, and then robot vehicle production. Some technological enhancements are so significant that disruption is inevitable, despite current "lock-ins".
And that's fine. We're never going to reach 100% efficiency in anything, ever (or 90% for that matter). We're always going to go with what works now, and what requires the least amount of retooling - UNLESS it's such a radical efficiency change that we simply must go along. THOSE are the innovations people actually care about. The 10-20% efficiency improvements, not so much (and rightly so).
To say nothing of virtualization!
And we solve the inefficiency with hypervisors!
I agree with the conclusion, but nothing about the arguments to arrive at that conclusion sound true at all.
Hardware does not have a Stockholm Syndrome. Making Chips that are better at things people use these Chips for is the correct thing to do. The CPU architecture is relatively static because making hardware for software which does not exist is a terrible idea. Look at Itanium, a total failure.
The C paradigm which the author bemoans never was the only one, but it was good, it worked well and that is why it is the default to this day. It is also good enough to emulate message passing and parallelism, it is totally non-obvious that throwing out this paradigm is in any way beneficial over the current status quo. There is no reason for the abstractions we want to also to be the abstractions the hardware uses.