This week, Google Cloud paid out their highest bug bounty yet ($150k) for a vulnerability that could have been prevented with ASI [0]. Good to see that Google is pushing forward with ASI despite the performance impact, because it would benefit the security of all hosting companies that use Linux/KVM, not just the cloud providers of big tech.
When enabling this new protection, could we potentially disable other mitigation techniques which become redundant and therefore re-gain some performance?
Yes! The numbers in the posting don't account for this.
Before doing this though, you need to be sure that ASI actually protects all the memory you care about. The version that currently exists protects all user memory but if the kernel copies something into its own memory it's now unprotected. So that needs to be addressed first (or some users might tolerate this risk).
My understanding was that many of the fixes for speculative execution issues themselves led to performance degradation, does anyone know the latest on that and how this compares?
Are these performance hit numbers inclusive of turning off the other mitigations?
I'm all for giving programmers a way to flush state, and maybe this is just a matter of taste, but I wouldn't characterize this as "taking care of the problem once and for all" unless there's a [magic?] way to recover from the performance trade-off that you'd see in "normal" operating systems (ie. not seL4).
It doesn't change the fact that when you implement a RISC-V core, you're going to have to partition/tag/track resources for threads that you want to be separated. Or, if you're keeping around shared state, you're going to be doing things like "flush all caches and predictors on every context switch" (can't tell if thats more or less painful).
Anyway, that all still seems expensive and hard regardless of whether or not the ISA exposes it to you :(
These numbers are all Vs a completely unmitigated system. AND, this is an extra-expensive version of ASI that does more work than really needed on this HW, to ensure we can measure the impact of the recent changes. (Details of this are in the posting).
So I should probably post something more realistic, and compare against the old mitigations. This will make ASI look a LOT better. But I'm being very careful not to avoid looking like a salesman here. It's better that I risk making things look worse than they are, than risk having people worry I'm hiding issues.
Not sure if you wrote this article and I appreciate an engineering desire to undersell, but if this is faster than what people actually do in practice, then the takeaway is different than if it is slower, so I think you're doing folks a disservice by not comparing to a realistic baseline in addition to an unmitigated one.
That's still really massive. It would only make sense in very high security environments.
Honestly running system services in VMs would be cheaper and just as good, or an OS like Qubes. VM hit is much smaller, less than 1% in some cases on newer hardware.
It makes sense in any environment you have two workloads sharing compute from two parties, public clouds.
The protection here is to ensure the vms are isolated. Without doing this there is the potential you can leak data via speculative execution across guests.
This was an issue for me a few years ago running docker on macOS. macOS required you to allocate memory to docker ahead of time, whereas Windows/Hyper-V was able to use memory ballooning in WSL2
The next steps should make this much faster. Google's internal version generally gives us a sub-1% hit on everything we measure.
If the community is up for merging this (which is a genuine question - the complexity hit is significant) I expect it to become the default everywhere and for most people it should be a performance win Vs the current default.
But, yes. Not there right now, which is annoying. I'm hoping the community is willing to start merging this anyway with the trust we can get it to be really fast later. But they might say "no, we need a full prototype that's super fast right now", which would be fair.
From reading the article that is the exactly also the feeling of the people involved. The question is if they are on track towards e.g. the 1% eventually.
Overhead should be minimal but something is preventing it from working as well as it theoretically should. AFAIK Microsoft has been improving VBS but I don't think it's completely fixed yet.
BF6 requiring VBS (or at least "VBS capable" systems) will probably force games to find a way to deal with VBS as much as they can, but for older titles it's not always a bad idea to turn off VBS to get a less stuttery experience.
As a network engineer I mainly like VMware workstation because of its awesome virtual network editor that lets me easily build complex topologies but it doesn't work when you use Hyper-V.
Same. Have to disable VBS for VirtualBox, and it gets more and more obscure with each update because some features like Windows Hello force it back on.
We're working on HPC / graphics / computer-vision software and noticed a particularly nasty issue with VBS enabled just last week. Although, have to be mentioned it was on Win10 Pro.
Only if you want to virtualize it or have vms, for VBS it simply disables hardware pcie memory space isolation. (With IOMMU on, each pcie device gets an isolated memory buffer).
Sometimes something in me starts thinking about if this regularly occurring slowing of chips through exploit mitigation is deliberate.
All of big tech wins: CPUs get slower and we need more vcpu's and more memory to serve our javascript slop to end customers: The hardware companies sell more hardware, the cloud providers sell more cloud.
I think it’s more pragmatic. We can eliminate hyperthreading to solve this, or increase memory safety at the cost of performance. One is a 50% hit in terms of vcpus, the other is now sub 50%.
These types of mitigations have the biggest benefit when resources are shared. Do you really think cloud vendors want to lose performance to CPU or other mitigations when they could literally sell those resources to customers instead?
They don't lose anything since they sell the same instance which performs less with the mitigations on.
Customers are paying because they need more instances.
Every CPU that isn’t pegged at 100% all the time is leaving money on the table. Some physical CPU capacity is reserved, some virtual CPU capacity is reserved, the rest goes to ultra-high-margin elastic compute that isn’t sold to you as a physical or virtual CPU. They sell it to you as “serverless,” it prints cash, and it absolutely depends on juicing every % of performance out of the chips.
edit: “burstable” CPUs are a fourth category relying on overselling the same virtual CPU while intelligently distributing workloads to keep them at 100%.
There are 3-4 year old servers with slower/fewer cores still operating fine and newer servers operating as well. The generation improvements seem to outweigh a lot of the mitigations in question, not to mention higher levels of parallel work.
Sometimes its fun to engage in a little conspiratorial thinking. My 2 cents... That TPM 2.0 requirement on Windows 11 is about to create a whole ton of e-waste in October (Windows 10 EOL).
I'm not so sure. Many people still ran Windows XP/7 long after the EOL date. Unless Chrome, Steam, etc drop support for Windows 10, I don't think many people will care.
There are many, many Windows XP systems still running today in many corporate and probably gov environments too. Even more Win 7 ones. There will be special contracts, workarounds, waivers, etc - all to avoid changing OS.
All the large cloud-hosted infra I’ve encountered in my career were written in JIT or AOT compiled languages (C#, Java, Golang, etc.) This is basically necessary at any sort of scale.
This week, Google Cloud paid out their highest bug bounty yet ($150k) for a vulnerability that could have been prevented with ASI [0]. Good to see that Google is pushing forward with ASI despite the performance impact, because it would benefit the security of all hosting companies that use Linux/KVM, not just the cloud providers of big tech.
[0] https://cyberscoop.com/cloud-security-l1tf-reloaded-public-c...
When enabling this new protection, could we potentially disable other mitigation techniques which become redundant and therefore re-gain some performance?
Yes! The numbers in the posting don't account for this.
Before doing this though, you need to be sure that ASI actually protects all the memory you care about. The version that currently exists protects all user memory but if the kernel copies something into its own memory it's now unprotected. So that needs to be addressed first (or some users might tolerate this risk).
My understanding was that many of the fixes for speculative execution issues themselves led to performance degradation, does anyone know the latest on that and how this compares?
Are these performance hit numbers inclusive of turning off the other mitigations?
There's about one way[0] to fix timing side channels.
The RISC-V ISA has an effort to standardize a timing fence[1][2], to take care of this once and for all.
0. https://tomchothia.gitlab.io/Papers/EuroSys19.pdf
1. https://lf-riscv.atlassian.net/wiki/spaces/TFXX/pages/538379...
2. https://sel4.org/Summit/2024/slides/hardware-support.pdf
I'm all for giving programmers a way to flush state, and maybe this is just a matter of taste, but I wouldn't characterize this as "taking care of the problem once and for all" unless there's a [magic?] way to recover from the performance trade-off that you'd see in "normal" operating systems (ie. not seL4).
It doesn't change the fact that when you implement a RISC-V core, you're going to have to partition/tag/track resources for threads that you want to be separated. Or, if you're keeping around shared state, you're going to be doing things like "flush all caches and predictors on every context switch" (can't tell if thats more or less painful).
Anyway, that all still seems expensive and hard regardless of whether or not the ISA exposes it to you :(
The research (multiple papers; note they have published and presented more than I linked) they've done prove that hardware help is necessary.
i.e. not about reducing cost, but about being able to kill timing side channels at all.
These numbers are all Vs a completely unmitigated system. AND, this is an extra-expensive version of ASI that does more work than really needed on this HW, to ensure we can measure the impact of the recent changes. (Details of this are in the posting).
So I should probably post something more realistic, and compare against the old mitigations. This will make ASI look a LOT better. But I'm being very careful not to avoid looking like a salesman here. It's better that I risk making things look worse than they are, than risk having people worry I'm hiding issues.
Not sure if you wrote this article and I appreciate an engineering desire to undersell, but if this is faster than what people actually do in practice, then the takeaway is different than if it is slower, so I think you're doing folks a disservice by not comparing to a realistic baseline in addition to an unmitigated one.
Furthermore, if the OS level mitigations are in place, would the hardware ones be disabled?
That's still really massive. It would only make sense in very high security environments.
Honestly running system services in VMs would be cheaper and just as good, or an OS like Qubes. VM hit is much smaller, less than 1% in some cases on newer hardware.
It makes sense in any environment you have two workloads sharing compute from two parties, public clouds.
The protection here is to ensure the vms are isolated. Without doing this there is the potential you can leak data via speculative execution across guests.
VMs suffer from memory use overhead. Would be cool if the guest kernel would cooperate with the host on that.
There's KSM that should help: https://pve.proxmox.com/wiki/Kernel_Samepage_Merging_(KSM)
Probably works best running VMs with the same kernel and software version.
But that just seems to reintroduce the same problem again:
> However, while KSM can reduce memory usage, it also comes with some security risks, as it can expose VMs to side-channel attacks. ...
It will! For Linux hosts and Linux guests, if you use virtio and memory ballooning.
This was an issue for me a few years ago running docker on macOS. macOS required you to allocate memory to docker ahead of time, whereas Windows/Hyper-V was able to use memory ballooning in WSL2
It's possible to address this to some extent with ballooning memory drivers, etc.
Look at it this way, any time a new side channel attack comes out the situation changes. Having this as a mitigation that can be turned on is helpful
The next steps should make this much faster. Google's internal version generally gives us a sub-1% hit on everything we measure.
If the community is up for merging this (which is a genuine question - the complexity hit is significant) I expect it to become the default everywhere and for most people it should be a performance win Vs the current default.
But, yes. Not there right now, which is annoying. I'm hoping the community is willing to start merging this anyway with the trust we can get it to be really fast later. But they might say "no, we need a full prototype that's super fast right now", which would be fair.
From reading the article that is the exactly also the feeling of the people involved. The question is if they are on track towards e.g. the 1% eventually.
Windows suffers from similar effects when Virtualization-Based Security is active.
At the same time VBS is one of the biggest steps forward in terms of Windows kernel security. It's actually considered a proper security boundary.
Funny that they called it VBS.
That's not something I'd easily associate with a step forward in security.
Hypervisor overhead should be low, https://www.howtogeek.com/does-windows-11-vbs-slow-pc-games/
What kind of workloads have noticeably lower performance with VBS?
It was measured to have a performance impact of up to 10%, with even higher numbers for the nth percentile lows: https://www.tomshardware.com/news/windows-vbs-harms-performa...
Overhead should be minimal but something is preventing it from working as well as it theoretically should. AFAIK Microsoft has been improving VBS but I don't think it's completely fixed yet.
BF6 requiring VBS (or at least "VBS capable" systems) will probably force games to find a way to deal with VBS as much as they can, but for older titles it's not always a bad idea to turn off VBS to get a less stuttery experience.
VBS requires hyper-v to be enabled and it "owns" the CPU virtualization hardware so I can't use VMware workstation which is very annoying.
VMWare Workstation [0] (and I thought VirtualBox - though I can't find any official docs [1]) should be able to use the Hyper-V hypervisor via WHP.
QEMU can also use WHP via --accel whpx.
[0] - https://techcommunity.microsoft.com/blog/virtualization/vmwa...
[1] - https://www.impostr-labs.com/use-hyper-v-and-virtualbox-toge...
It works indeed, but the performance drop is quite drastic.
As a network engineer I mainly like VMware workstation because of its awesome virtual network editor that lets me easily build complex topologies but it doesn't work when you use Hyper-V.
Same. Have to disable VBS for VirtualBox, and it gets more and more obscure with each update because some features like Windows Hello force it back on.
BF6 requires this? Is there any official article/link about this? Thank you!
The closest so far (I don't know the specifics of VBS vs. Secure Boot):
https://news.ycombinator.com/item?id=44805565 Secure Boot is a requirement to play Battlefield 6 on PC
> It's the Javelin Anti cheat system which forces the use of secure boot
Thanks! I've found this https://www.reddit.com/r/Battlefield/comments/1mebjom/tpm_20...
We're working on HPC / graphics / computer-vision software and noticed a particularly nasty issue with VBS enabled just last week. Although, have to be mentioned it was on Win10 Pro.
This most likely comes from IOMMU - disable it.
That’d break a lot of GPU setups
Only if you want to virtualize it or have vms, for VBS it simply disables hardware pcie memory space isolation. (With IOMMU on, each pcie device gets an isolated memory buffer).
Anything that runs on an ISA that has certain features has these effects, IIRC.
Sometimes something in me starts thinking about if this regularly occurring slowing of chips through exploit mitigation is deliberate.
All of big tech wins: CPUs get slower and we need more vcpu's and more memory to serve our javascript slop to end customers: The hardware companies sell more hardware, the cloud providers sell more cloud.
I think it’s more pragmatic. We can eliminate hyperthreading to solve this, or increase memory safety at the cost of performance. One is a 50% hit in terms of vcpus, the other is now sub 50%.
They also need some phony justifications though.
Can't just turn off hyperthreading.
These types of mitigations have the biggest benefit when resources are shared. Do you really think cloud vendors want to lose performance to CPU or other mitigations when they could literally sell those resources to customers instead?
They don't lose anything since they sell the same instance which performs less with the mitigations on. Customers are paying because they need more instances.
Every CPU that isn’t pegged at 100% all the time is leaving money on the table. Some physical CPU capacity is reserved, some virtual CPU capacity is reserved, the rest goes to ultra-high-margin elastic compute that isn’t sold to you as a physical or virtual CPU. They sell it to you as “serverless,” it prints cash, and it absolutely depends on juicing every % of performance out of the chips.
edit: “burstable” CPUs are a fourth category relying on overselling the same virtual CPU while intelligently distributing workloads to keep them at 100%.
I imagine they're unable to squeeze as many instances onto their giant computers, though.
There are 3-4 year old servers with slower/fewer cores still operating fine and newer servers operating as well. The generation improvements seem to outweigh a lot of the mitigations in question, not to mention higher levels of parallel work.
Sometimes its fun to engage in a little conspiratorial thinking. My 2 cents... That TPM 2.0 requirement on Windows 11 is about to create a whole ton of e-waste in October (Windows 10 EOL).
I'm not so sure. Many people still ran Windows XP/7 long after the EOL date. Unless Chrome, Steam, etc drop support for Windows 10, I don't think many people will care.
The home PC market is insignificant. The real volume is in corporate and government systems that will never run EOL Windows.
Side Note: Folks, don't run EOL operating systems at home. Upgrade to Linux or BSD, and your hardware can live on safely.
There are many, many Windows XP systems still running today in many corporate and probably gov environments too. Even more Win 7 ones. There will be special contracts, workarounds, waivers, etc - all to avoid changing OS.
> Folks, don't run EOL operating systems at home.
Especially not EOL Windows.
Hey, it's not nice to call Linux users "e-waste."
Why would big tech do this when customers bring it upon themselves by building Javascript slop?
Big tech isnt running their stack on js.
Maybe, but their cloud customers certainly are.
All the large cloud-hosted infra I’ve encountered in my career were written in JIT or AOT compiled languages (C#, Java, Golang, etc.) This is basically necessary at any sort of scale.
Cloud usage is dominated by larger companies with much older codebases that predates modern js backend development.
As long as the customer pays, why wouldn't they promote an option which makes them more profit?