HBM or not, those latest server chips are crazy fast and efficient. You can probably condense 8 servers from just a few years ago into one latest-gen Epyc.
I run BareMetalSavings.com[0], a toy for ballpark-estimating bare-metal/cloud savings, and the things you can do with just a few servers today are pretty crazy.
Core counts have increased dramatically. The latest AMD server CPUs have up to 192 cores. The Zen1 top model had only 32 cores and that was already a lot compared to Intel. However, the power consumption has also increased: the current top model has a TDP of 500W.
Does absolute power consumption matter or would it not be better to focus on per-core power consumption? Eg running 6 32-core CPUs seems unlikely to be better than 1 192-core.
Yes, per core power consumption or better performance per Watt is usually more relevant than the total power consumption. And 1 high-core CPU is usually better than the same number of cores on multiple CPUs. (That is unless you are trying to maximize memory bandwidth per Watt.)
What I wanted to get at is that the pure core count can be misleading if you care about power consumption. If you don't and just look at performance, the current CPU generations are monsters. But if you care about performance/Watt, the improvement isn't that large. The Zen1 CPU I was talking about had a TDP of 180 W. So you get 6x as many cores, but the power consumption increases by 2.7x.
That could be an interesting site when it's done but I couldn't see where you factor in the price of electricity for running bare metal in a 24/7 climate-controlled environment, which I would assume expect is the biggest expense by far.
The first FAQ question addresses exactly that: colocation costs are added to every bare metal item (even storage drives).
Note that this doesn't intend to be used for accounting, but for estimating, and it's good at that. If anything, it's more favorable to the cloud (e.g, no egress costs).
If you're on the cloud right now and BMS shows you can save a lot of money, that's a good indicator to carefully research the subject.
Qwen coder 32b instruct is the state of the art for local LLM coding and will run with a smallish context with that on a 64GB laptop with partial GPU offload. Probably around .8 tok/sec.
With a quantization of it you can run larger contexts and go a bit faster. 1.4 tok/sec at 8b quant with offload to a 6GB laptop GPU.
Speculative decoding has been being added to lots of the runtimes recently and can give a 20-30% boost with a 1 billion weight model running the speculative token stream.
Partly because I can, because unless you go absolutely wild with excess it’s the RAM equivalent of fuck-you money. (Note it’s unified though, so in some situations a desktop with 48GB main RAM and 16GB VRAM can be comparable, and from what I know about today’s desktops that could be a good machine but not a lavish one.) Partly because I need to do exploratory statistics to say ten- or twenty-gigabyte I/O traces, and being able to chuck the whole thing into Pandas and not agonize over cleaning up every temporary is just comfy.
I have 128gb in PC (largely because I can) and android studio, a few containers and running emulators will take a sizable bite into that. My 18gb MacBook would be digging into swap and compressing to get there.
Memory can get eaten up pretty quickly between IDEs, containers, and other dev tools. I have had a combination of a fairly small C++ application, clion, and a container use up more than 32GB when combined with my typical applications.
I just built a new PC with 64GB for just this reason. With my workloads, the 32GB in my work laptop is getting cramped. For an extra $150 I can double that and not worry about memory for the next several years
Running 128 GiB of RAM on the box I am typing on. I could list a lot of things but if you really wanted a quick demonstration, compiling Chromium will eat 128 GiB of RAM happily.
This is very unlikely, but it would be interesting if Apple included HBM memory interfaces in the Max series of Apple Silicon, to be used in MacPro (and maybe the studio, but the Pro needs some more differentiation like HBM or a NUMA layout).
They'd have to redesign the on die memory controller and tape out a new die all of which is expensive. Apple is a consumer technology company not a cutting tech tech company making high cost products for niche markets. There's just no way to make HBM work in the consumer space at the current price.
Well, they could put in a memory controller for both DDR5 and HBM on the die, so they would only have one die to tape out.
The Max variant is something they are using in their own datacenters. It would be possible that they would use an HBM solely for themselves, but it would be cheaper overall if they did the same thing for workstations.
HBM has a very wide, relatively slow interface. A HBM phy is physically large and takes up a lot of beachfront, a massive waste of area (money) if you're not going to use it. It also (currently) requires you to use a silicon interposer, another huge extra expense in your design.
> A HBM phy is physically large and takes up a lot of beachfront, a massive waste of area (money) if you're not going to use it.
The M3 Max dropped the area for the interposer to connect two chips, and there was no resulting Ultra chip.
But the M1 Max and M2 Max both did.
I have yet to see an x-ray of the M4 Max to see if they have built in support for combining two, have used area for HBM or anything exotic, but they have done it before.
Could you recognize HBM support in an x-ray?
As for the Ultra, they used to have 2.5 TB/s of interprocessor bandwidth years ago based on M1, so I hope they would step that up a notch.
I don’t put much stock in the idea of the 4 or 8 way hydra. I think HBM would be more useful, but I’m just a rando on the interwebs.
OK, but that's an ultra expensive chip, pretty much by definition. The suggestion was to burden other products with that big expense, and that doesn't make sense to me.
People are buying dual Epyc Zen5 systems to get 24 DDR5-6000 memory channel bandwidth for inferemcing large LLMs on CPU. Clearly there is a demand for very fast memory.
Sure, implicit finite elements analysis scales up to two cores per DDR4 channel. Core density just grew up faster than bandwidth and it makes all those high cores cpus a waste for this kinds of workloads.
I’m having trouble parsing the article even though I know fully what the mi300 is and what hbm memory is.
I’m not alone right? This article seems to be complete ai nonsense at various points confusing the gpu and cpu portions of the product and not at all giving clarity on which parts of the product have hbm memory.
I agree, the article lacks clarity, jumping between 3 different AMD models and an Intel one. I'd suggest it's flaws hint at a human writer more than an AI.
They'd be very expensive. Is there really a consumer market for large amounts (tens of GBs) of RAM with super high (800+GB/s) bandwidth? I guess you'll say AI applications but doing that amount of work on a mobile seems mad.
Yeah, I feel similarly about the development of NPUs. I guess it might be useful if we find more maybe non-AI uses for high-bandwidth memory that’s needed on the edge and not in centralized servers.
This website and its spend-128-hours-disabling-1024-separate-cookies-and-vendors is pure cancer, I wish HN would just ban all these disgusting data hoovering leeches.
By now I get that no one else cares and I should just stop coming here.
That entire site is 100% ai generated click farming. The fact that the top comments here not even talking about the content of the article but instead more general ‘hbm is great’ worries me.
For anyone that read the article which product has hbm attached? The cpu or gpu? What is the name of this product?
There’s literally nothing specific in here and the article is rambling ai nonsense. The whole site is a machine gun of such articles.
1000% agreed, especially the silent carrying on with the topic without acknowledging the cancer actual link. I know e.g. Deng is a real person who works hard to make HN decent, but wtf is this, are we just going to accept Daily Mail links too?
These kind of articles are like denial of service attacks on human attention. If I read just a few of those, I would be confused for the rest of the day.
It's far from an ideal solution but I've started to just have JS disabled by default in uBlock Origin and then enabling it manually on a per-site basis. Bit of a hassle but many sites, including this one, render just fine witbout JS and are arguably better without it.
Another article source that uses a headline initialisation, "HBM"[0] in this case, and almost 30 times at that, and yet doesn't spell out what it stands for even once. I will point this out every time I see it, and continue to refuse to read from places that don't follow this simple etiquette.
"Please don't pick the most provocative thing in an article or post to complain about in the thread. Find something interesting to respond to instead."
I don't feel like that rule works here? If you cut out part of the second sentence to get "Find something interesting to respond to", that's a good point, but the full context is "instead [of the most provocative thing in the article]" and that doesn't fit a complaint about acronyms.
To paraphrase McLuhan, you don't like that guideline? We got others:
"Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting."
The point, in any case, is to avoid off-topic indignation about tangential things, even annoying ones.
There were two reasons it was drilled into me in engineering-school; it provides context and avoidance of doubt about the topic, particularly when there are so many overlapping initialisms these days, often in the same space.
The second is that you should never make assumptions about the audience of your writing, and their understanding of the topic; provide any and all information that might be pertinent for a non-subject-matter-specialist to understand, or at least find the information they need to understand.
> They're almost certainly not on this forum, and they're not reading your post
I don't know much about the site in the OP, but I work on the assumption that almost anyone could be reading comments on links to their site on this forum.
HBM or not, those latest server chips are crazy fast and efficient. You can probably condense 8 servers from just a few years ago into one latest-gen Epyc.
I run BareMetalSavings.com[0], a toy for ballpark-estimating bare-metal/cloud savings, and the things you can do with just a few servers today are pretty crazy.
[0]: https://www.BareMetalSavings.com
Core counts have increased dramatically. The latest AMD server CPUs have up to 192 cores. The Zen1 top model had only 32 cores and that was already a lot compared to Intel. However, the power consumption has also increased: the current top model has a TDP of 500W.
Does absolute power consumption matter or would it not be better to focus on per-core power consumption? Eg running 6 32-core CPUs seems unlikely to be better than 1 192-core.
Yes, per core power consumption or better performance per Watt is usually more relevant than the total power consumption. And 1 high-core CPU is usually better than the same number of cores on multiple CPUs. (That is unless you are trying to maximize memory bandwidth per Watt.)
What I wanted to get at is that the pure core count can be misleading if you care about power consumption. If you don't and just look at performance, the current CPU generations are monsters. But if you care about performance/Watt, the improvement isn't that large. The Zen1 CPU I was talking about had a TDP of 180 W. So you get 6x as many cores, but the power consumption increases by 2.7x.
Makes sense, thanks for the good reply.
a graph showing this against cloud instance costs and aws profits would be funny.
That could be an interesting site when it's done but I couldn't see where you factor in the price of electricity for running bare metal in a 24/7 climate-controlled environment, which I would assume expect is the biggest expense by far.
The first FAQ question addresses exactly that: colocation costs are added to every bare metal item (even storage drives).
Note that this doesn't intend to be used for accounting, but for estimating, and it's good at that. If anything, it's more favorable to the cloud (e.g, no egress costs).
If you're on the cloud right now and BMS shows you can save a lot of money, that's a good indicator to carefully research the subject.
So currently our consumer grade CPUs with DDR5 are limited to less than 100GB/s. Meanwhile Apple is shipping computers with multiples of that.
On the other hand, I bought an 8C/16T Zen 4 laptop with 64GB RAM and an 4TB SSD for less than $2000 total including tax. I’ll take that trade.
How are 70b LLMs running on that?
Qwen coder 32b instruct is the state of the art for local LLM coding and will run with a smallish context with that on a 64GB laptop with partial GPU offload. Probably around .8 tok/sec.
With a quantization of it you can run larger contexts and go a bit faster. 1.4 tok/sec at 8b quant with offload to a 6GB laptop GPU.
Speculative decoding has been being added to lots of the runtimes recently and can give a 20-30% boost with a 1 billion weight model running the speculative token stream.
The free version of chatgpt is better than your 70b LLM, whats the point?
Why do you need 64GB RAM?
Partly because I can, because unless you go absolutely wild with excess it’s the RAM equivalent of fuck-you money. (Note it’s unified though, so in some situations a desktop with 48GB main RAM and 16GB VRAM can be comparable, and from what I know about today’s desktops that could be a good machine but not a lavish one.) Partly because I need to do exploratory statistics to say ten- or twenty-gigabyte I/O traces, and being able to chuck the whole thing into Pandas and not agonize over cleaning up every temporary is just comfy.
I have 128gb in PC (largely because I can) and android studio, a few containers and running emulators will take a sizable bite into that. My 18gb MacBook would be digging into swap and compressing to get there.
Memory can get eaten up pretty quickly between IDEs, containers, and other dev tools. I have had a combination of a fairly small C++ application, clion, and a container use up more than 32GB when combined with my typical applications.
I just built a new PC with 64GB for just this reason. With my workloads, the 32GB in my work laptop is getting cramped. For an extra $150 I can double that and not worry about memory for the next several years
Running 128 GiB of RAM on the box I am typing on. I could list a lot of things but if you really wanted a quick demonstration, compiling Chromium will eat 128 GiB of RAM happily.
Nobody ever regretted having extra memory on their computer.
Maybe because of electron apps
Several Electron apps and 1000+ Chrome tabs. (just guessing)
Strix Halo is rumored to be about twice as fast but unfortunately not near Apple's speed.
This is very unlikely, but it would be interesting if Apple included HBM memory interfaces in the Max series of Apple Silicon, to be used in MacPro (and maybe the studio, but the Pro needs some more differentiation like HBM or a NUMA layout).
They'd have to redesign the on die memory controller and tape out a new die all of which is expensive. Apple is a consumer technology company not a cutting tech tech company making high cost products for niche markets. There's just no way to make HBM work in the consumer space at the current price.
Well, they could put in a memory controller for both DDR5 and HBM on the die, so they would only have one die to tape out.
The Max variant is something they are using in their own datacenters. It would be possible that they would use an HBM solely for themselves, but it would be cheaper overall if they did the same thing for workstations.
HBM has a very wide, relatively slow interface. A HBM phy is physically large and takes up a lot of beachfront, a massive waste of area (money) if you're not going to use it. It also (currently) requires you to use a silicon interposer, another huge extra expense in your design.
> A HBM phy is physically large and takes up a lot of beachfront, a massive waste of area (money) if you're not going to use it.
The M3 Max dropped the area for the interposer to connect two chips, and there was no resulting Ultra chip.
But the M1 Max and M2 Max both did.
I have yet to see an x-ray of the M4 Max to see if they have built in support for combining two, have used area for HBM or anything exotic, but they have done it before.
Could you recognize HBM support in an x-ray?
As for the Ultra, they used to have 2.5 TB/s of interprocessor bandwidth years ago based on M1, so I hope they would step that up a notch.
I don’t put much stock in the idea of the 4 or 8 way hydra. I think HBM would be more useful, but I’m just a rando on the interwebs.
> It also (currently) requires you to use a silicon interposer, another huge extra expense in your design.
Guess what the Ultra chips use? That’s right, a silicon interposer. :)
OK, but that's an ultra expensive chip, pretty much by definition. The suggestion was to burden other products with that big expense, and that doesn't make sense to me.
I guess I was pointing out that they already do that with the UltraFusion interconnect that’s on the Max chip found in notebooks but never used there.
But the more I think about it, the more I bet they are creating a native Ultra chip that is not a combo of two Max chips.
I bet the Ultra will have the interconnect so you can put two together and get the often rumored Extreme chip.
They will have enough volume for their own datacenters, that the Mac Studio and Mac Pro will simply be consumer beneficiaries.
It makes more sense in this framing to put HBM on these chips. And no DDR5.
In this case, the M4 Max has neither HBM nor the interconnect. I’d love to see someone de-lid and get an X-ray die shot.
The MacPro is not a consumer device. It is very much a high cost niche (professional) product.
It may be priced like one but the technology inside it isn't.
People are buying dual Epyc Zen5 systems to get 24 DDR5-6000 memory channel bandwidth for inferemcing large LLMs on CPU. Clearly there is a demand for very fast memory.
Sure, implicit finite elements analysis scales up to two cores per DDR4 channel. Core density just grew up faster than bandwidth and it makes all those high cores cpus a waste for this kinds of workloads.
I’m having trouble parsing the article even though I know fully what the mi300 is and what hbm memory is.
I’m not alone right? This article seems to be complete ai nonsense at various points confusing the gpu and cpu portions of the product and not at all giving clarity on which parts of the product have hbm memory.
I agree, the article lacks clarity, jumping between 3 different AMD models and an Intel one. I'd suggest it's flaws hint at a human writer more than an AI.
Does it make sense to put HBM memory on mobile computing like laptops and smartphones?
They'd be very expensive. Is there really a consumer market for large amounts (tens of GBs) of RAM with super high (800+GB/s) bandwidth? I guess you'll say AI applications but doing that amount of work on a mobile seems mad.
Yeah, I feel similarly about the development of NPUs. I guess it might be useful if we find more maybe non-AI uses for high-bandwidth memory that’s needed on the edge and not in centralized servers.
This website and its spend-128-hours-disabling-1024-separate-cookies-and-vendors is pure cancer, I wish HN would just ban all these disgusting data hoovering leeches.
By now I get that no one else cares and I should just stop coming here.
That entire site is 100% ai generated click farming. The fact that the top comments here not even talking about the content of the article but instead more general ‘hbm is great’ worries me.
For anyone that read the article which product has hbm attached? The cpu or gpu? What is the name of this product?
There’s literally nothing specific in here and the article is rambling ai nonsense. The whole site is a machine gun of such articles.
1000% agreed, especially the silent carrying on with the topic without acknowledging the cancer actual link. I know e.g. Deng is a real person who works hard to make HN decent, but wtf is this, are we just going to accept Daily Mail links too?
I weep for the internet we had as children.
These kind of articles are like denial of service attacks on human attention. If I read just a few of those, I would be confused for the rest of the day.
Epyc 9V64H
It's far from an ideal solution but I've started to just have JS disabled by default in uBlock Origin and then enabling it manually on a per-site basis. Bit of a hassle but many sites, including this one, render just fine witbout JS and are arguably better without it.
That describes almost all news sites even if most of them don't make it that obvious. Just open them in web.archive.org and avoid all of that.
Nope, you are not alone. Without uBlock Origin I wouldn't go much anywhere these days.
Agreed, I wouldn't mind banning paywall crap while we're at it.
Another article source that uses a headline initialisation, "HBM"[0] in this case, and almost 30 times at that, and yet doesn't spell out what it stands for even once. I will point this out every time I see it, and continue to refuse to read from places that don't follow this simple etiquette.
Be better.
[0] High Bandwidth Memory
"Please don't pick the most provocative thing in an article or post to complain about in the thread. Find something interesting to respond to instead."
https://news.ycombinator.com/newsguidelines.html
While I understand, I can't find something interesting to respond to in this case because I refused to read it.
I'd argue that I also provided value by solving the complaint I made by spelling out what it stood for, for those who might not know.
I hear you and agree there's benefit in that; it's just that the cost (what it does to the thread) is a lot larger than the benefit.
I don't feel like that rule works here? If you cut out part of the second sentence to get "Find something interesting to respond to", that's a good point, but the full context is "instead [of the most provocative thing in the article]" and that doesn't fit a complaint about acronyms.
To paraphrase McLuhan, you don't like that guideline? We got others:
"Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting."
The point, in any case, is to avoid off-topic indignation about tangential things, even annoying ones.
Yeah that one works.
They don't define HPC either, but I think the audience of this site knows these acronyms.
There were two reasons it was drilled into me in engineering-school; it provides context and avoidance of doubt about the topic, particularly when there are so many overlapping initialisms these days, often in the same space.
The second is that you should never make assumptions about the audience of your writing, and their understanding of the topic; provide any and all information that might be pertinent for a non-subject-matter-specialist to understand, or at least find the information they need to understand.
> Be better
They're almost certainly not on this forum, and they're not reading your post. So who is that quip directed at?
> They're almost certainly not on this forum, and they're not reading your post
I don't know much about the site in the OP, but I work on the assumption that almost anyone could be reading comments on links to their site on this forum.
It's directed at them, you and even myself.
Presumably it's directed at anyone writing an article for public consumption.