On my desktop system, most of my problems with swap come from dealing with the aftermath of an out-of-control process eating all my RAM. In this case, the offending program demands memory so quickly that everything from legitimate programs gets swapped out. These programs proceed to run poorly for the next several minutes to an hour depending on usage, since the OS only swaps pages back in once they are referenced, even if there is plenty of free space not even being used in the disk cache.
Eventually I wrote a small script that does the equivalent of "sudo swapoff -a && sudo swapon -a" to eagerly flush everything to RAM, but I was surprised by how many people seemed to think there's no legitimate reason to ever want to do so.
That works if there is enough memory after the "bad" process has been killed. The question is, is it necessary? Many systems can live with processes performing a little bit poorly for some minutes and I wouldn't do it.
It's fine that "many systems" can. But there is no easy way when the user or system can't. Flushing back to RAM is slow - that's not controversial. So it would help if there was a way to do this in advance of the need for the programs where that matters.
> The question is, is it necessary? Many systems can live with processes performing a little bit poorly for some minutes and I wouldn't do it.
The outage ain't resolved until things are back to operating normally.
If things aren't back to 100% healthy, could be I didn't truly find the root cause of the problem - in which case I'll probably be woken up again in 30 minutes when the problem comes back.
The article has not mentioned memory compression as an alternative to swap which many Linux distributions enable by default.
On the other hand these days latest SSD are way faster than memory compression even with LUKS encryption on and even when compression uses LZ4 compression. Plus modern SSDs do not suffer from frequent writes as before so on my laptop I disabled the memory compression and then all reasoning from the article applies again.
Then on a development laptop running compilations/containers/VMs/browser vm.swappines does not seems matter that much if one has enough memory. So I no longer tune it to 100 or more and leave at the default 60%.
I've been telling people about this since the days when there were operating systems still around that actually did swapping (16-bit OS/2, old Unix, Standard Mode DOS+Windows) rather than paging (32-bit OS/2, 386 Enhanced Mode DOS+Windows, Windows NT). I wrote a Frequently Given Answer about it in 2007, I had had to repeat the point so many times since the middle 1990s; and I was far from alone even then.
The erroneous folk wisdom is widespread. It often seems to lack any mention of the concepts of a resident set and a working set, and is always mixed in with a wishful thinking idea that somehow "new" computers obviate this, when the basic principles of demand paging are the same as they were four decades ago, Parkinson's Law can still be observed operating in the world of computers, and the "new" computers all of those years ago didn't manage to obviate paging files either.
The swapfile.sys in Windows 8+ is used for process swapping (moving the entire private working set out of memory to disk), but only for UWP applications.
Recognition that older linux swap strategies were unhelpful sometimes, which this piece of writing does, validates out past sense it wasn't working well. Regaining trust takes time.
Sometimes I think if backing store and swap were more clearly delineated we might have got to decent algorithms sooner. Having a huge amount of swap pre-emptively claimed was making it look like starvation, when it was just a runtime planning strategy. It's also confusing how top and vmstat report things.
Also, as a BSD mainly person, I think the differences stand out. I haven't noticed an OOM killer approach on BSD.
Ancient model: twice as much swap as memory
Old model: same amount of swap as memory
New model: amount of swap your experience tells you this job mix demands to manage memory pressure fairly, which is a bit of a tall ask sometimes, but basically pick a number up to memory size.
> Also, as a BSD mainly person, I think the differences stand out. I haven't noticed an OOM killer approach on BSD.
BSD allocators simply return errors if no more memory is available; for backwards compatibility reasons Linux is stuck with a fatally flawed API that doesn't.
You can trivially disable overcommit on Linux (vm.overcommit_memory=2) to get allocation failures instead of OOMs. But you will find yourself spending a lot more money on RAM :)
Overcommit is subtle. If you allocate a bunch of address space and don't touch it, that's one thing.
If you allocate and touch everything, and then try to allocate more, it's better to get an allocation error than an unsatifyable page fault later.
My understanding (which could very well be wrong) is Linux overcommit will continue to allocate address space when asked regardless of memory pressure; but FreeBSD overcommit will refuse allocations when there's too much memory pressure.
I'm pretty sure I've seen FreeBSD's OOM killer, but it needs a specific pattern of memory use, it's much more likely for an application to get a failed allocation and exit, freeing memory; than for all the applications to have unused allocations that they then use.
All that said, I prefer to run with a small swap, somewhere around 0.5-2GB. Memory pressure is hard to measure (although recent linux has a measure that I haven't used), but swap % and swap i/o are easy to measure. If your swap grows quickly, you might not have time to do any operations to fix it, but your stats should tell the tale. If your swap grows slowly enough, you can set thresholds and analyze the situation. If you have a lot of swap i/o that provides a measure of urgency.
For modern Linux servers with large amounts of RAM, my rule of thumb is between 1/8 and 1/32 of RAM, depending on what the machine is for.
For example, one of my database servers has 128GB of RAM and 8GB of swap. It tends to stabilize around 108GB of RAM and 5GB of swap usage under normal load, so I know that a 4GB swap would have been less than optimal. A larger swap would have been a waste as well.
Yeh. I haven't yet figured out how to get zram to apply transparently to containers though, anything in another memory cgroup will never get compressed unless swap is explicitly exposed to it.
The proper rule of thumb is to make the swap large enough to keep all inactive anonymous pages after the workload has stabilized, but not too large to cause swap thrashing and a delayed OOM kill if a fast memory leak happens.
Another rule of thumb is that performance degradation due to the active working set spilling into the swap is exponential - 0.1% excess causes 2x degradation, 1% - 10x degradation, 10% - 100x degradation (assuming 10^3 difference in latency between RAM and SSD).
I would approach the issue from the other direction. Start by buying enough RAM to contain the active working set for the foreseeable future. Afterward, you can start experimenting with different swap sizes (swapfiles are easier to resize, and they perform exactly as well as swap partitions!) to see how many inactive anonymous pages you can safely swap out. If you can swap out several gigabytes, that's a bonus! But don't take that for granted. Always be prepared to move everything back into RAM when needed.
I am testing a distributed database-like system at work that makes heavy use of swap. At startup, we read a table from S3 and compute a recursive materialized view over it. This needs about 4TB of “memory” per node while computing, which we provide as 512gb of RAM + 3900GB of NVMe zswap enabled swap devices. Once the computation is complete, we’re left with a much smaller working set index (about 400gb) we use to serve queries. For this use-case, swap serves as a performant and less labor intensive approach to manually spilling the computation to disk in application code (although there is some mlock going on; it’s not entirely automatic). This is like a very extreme version of the initialization-only pages idea discussed in the articule.
The warm up computation does take like 1/4 the time if it can live entirely in RAM, but using NVMe as “discount RAM” reduces the United States dollar cost of the system by 97% compared to RAM-only.
The problem with heavy swapping on NVMe (or other flash memory) is that it wears out the flash storage very quickly, even for seemingly "reasonable" workloads. In a way, the high performance of NVMe can work against you. Definitely something you want to check out via SMART or similar wearout stats.
Let’s say we’re spending $1 million on hardware hypothetically with the swap setup.
At that price point, either we use swap and let the kernel engineers move data from RAM to disk and back, or we disable swap and need user space code to move the same data to disk and back. We’d need to price out writing & maintaining the user space implementation (mmap perhaps?) for it to be fair price comparison.
To avoid SSD wear and tear, we could spend $29 million a year more to put the data in RAM only. Not worth!
(We rent EC2 instances from AWS, so SSD wear is baked into the pricing)
Not an issue for the commenter – since they have mentioned S3, they are either using AWS EBS or instance attached scratch NVMe's which the vendor (AWS) takes care of.
The AWS control plane will detect an ailing SSD backing up the EBS and will proactively evacuate the data before the physical storage goes pear shaped.
If it is an EC2 instance with an instance attached NVMe, the control plane will issue an alert that can be automatically acted upon, and the instance can be bounced with a new EC2 instance allocated from a pool of the same instance type and get a new NVMe. Provided, of course, the design and implementation of the running system are stateless and can rebuild the working set upon a restart.
Each node handles an independent ~4TB shard of data in horizontal scale-out fashion. Perhaps we could try some complex shenanigans where we rent 4TB RAM nodes, compute, send to 512GB RAM nodes then terminate the 4TB nodes but that’s a bunch of extra complexity for not much of a win.
As I understand this article, swap is useful for cases where many long-lived programs (daemons) allocate a lot of memory, but almost never access it. But wouldn't it be better to avoid writing such programs? And how many memory such daemons can consume? A couple of hundred megabytes total? Is it really that much on modern systems?
My experience with swap shows, that it only makes things worse. When I program, my application may sometimes allocate a lot of memory due to some silly bug. In such case the whole system practically stops working - even mouse cursor can't move. If I am happy, OOM killer will eventually kill my buggy program, but after that it's not over - almost all used memory is now in swap and the whole system works snail-slow, presumably because kernel doesn't think it should really unswap previously swapped memory and does this only on demand and only page by page.
I a hypothetical case without swap this case isn't so painful. When main system memory is almost fully consumed, OOM killer kills the most memory hungry program and all other programs just continue working as before.
I think that overall reliance on swap is noways just a legacy of old times when main memory was scarce and back than it maybe was useful to have swap. OS kernels should be redesigned to work without swap, this will make system behavior smoother and kernel code may be simpler (all this swapping code may be removed) and thus faster.
You may benefit by reducing your swap size significantly.
The old rule of thumb of 1-2x your ram is way too much for most systems. The solution isn't to turn it off, but to have a sensible limit. Try with half a gig of swap and see how that does. It may give you time to notice the system is degraded and pick something to kill yourself and maybe even debug the memory issue if needed. You're not likely to have lasting performance issues from too many things swapped out after you or the OOM killer end the memory pressure, because not much of your memory will fit in swap.
> As I understand this article, swap is useful for cases where many long-lived programs (daemons) allocate a lot of memory, but almost never access it. But wouldn't it be better to avoid writing such programs?
Ideally yes, but is that something you keep in mind when you write software? Do you ever consider freeing memory just because it hasn't been used in a while? How do you decide when to free it? This is all handled automatically when you have swap enabled, and at a granularity that is much higher than you can practically manually implement it.
I write mostly C++ or Rust programs. In these languages memory is freed as soon as it's no longer in use (thanks to destructors). So, usually this shouldn't be actively kept in mind. The only exception are cases like caches, but long-running programs should use caching carefully - limit cache size and free cache entries after some amount of time.
Programs, which allocate large amounts of memory without strict necessity to do so, are just a consequence of swap existence. "Thanks" to swap they weren't properly tested in low-memory conditions and thus no necessary optimization were done.
You'll also need to consider that the allocator you're using may not immediately free memory to the system. That memory is free to be used by your application but considered as used memory mapped to your program.
Anyway, it's easy to discuss best practices but people actually following them is the actual issue. If you disable swap and the software you're running isn't optimized to minimize idle memory usage then your system will be forced to keep all of that data in RAM.
You are both confusing swap and memory overcommit policy. You can disable swap by compiling the kernel with `CONFIG_SWAP=no`, but it won't change the memory overcommit policy, and programs would still be able to allocate more memory than available on the system. There is no problem in allocating the virtual memory - if it isn't used, it never gets mapped to the physical memory. The problem is when a program tries to use more memory than the system has, and you will get OOMs even with the swap disabled. You can disable memory overcommit, but this is only going to result in malloc() failing early while you still have tons of memory.
A side note, stack memories are usually not physically returned to the OS. When (de)allocating on stack, only the stack pointer is moved within the pages preallocated by the OS.
> In these languages memory is freed as soon as it's no longer in use (thanks to destructors).
Unless you have an almost pathological attention to detail, that is not true at all. And even if you do precisely scope your destructors, the underlying allocator won't return the memory to the OS (what matters here) immediately.
> Programs, which allocate large amounts of memory without strict necessity to do so, are just a consequence of swap existence. "Thanks" to swap they weren't properly tested in low-memory conditions and thus no necessary optimization were done.
Who told you this? It's not remotely true.
Here's an article about this subject that you might want to read:
And were you aware that freeing memory only allows it to be reallocated within your process but doesn't actually release it from your process? State-of-the-art general-purpose allocators are actually still kind of shit.
> I a hypothetical case without swap this case isn't so painful. When main system memory is almost fully consumed, OOM killer kills the most memory hungry program
That's not how it works in practice. What happens is that program pages (and read-only data pages) get gradually evicted from memory and the system still slows to a crawl (to the point where it becomes practically unresponsive) because every access to program text outside the current 4KB page now potentially involves a swap-in. Sure, eventually, the memory-hungry task will either complete successfully or the OOM killer will be called, but that doesn't help you if you care about responsiveness first and foremost (and in practice, desktop users do care about that - especially when they're trying to terminate that memory hog).
Why not just always preserving program code in memory? It's usually not that much - typical executable is usually several megabytes in size and many processes can share the same code memory pages (especially with shared libraries).
> It's usually not that much - typical executable is usually several megabytes in size and many processes can share the same code memory pages (especially with shared libraries)
Have a look at Chrome. Then have a look at all the Electron "desktop" apps, which all ship with a different Chrome version and different versions of shared libraries, which all can't share memory pages, because they're subtly different. You find similar patterns across many, many other workloads.
Or modern languages, like Rust and Go, which have decided that runtime dependencies are too hard and instead build enormous static binaries for everything.
Programs and shared libraries (pages with VM_EXEC attribute) are kept in the memory if they are actively used (have the "accessed" bit set by the CPU) and are least likely to be evicted.
> Why not just always preserving program code in memory?
Because the code is never required in its entirety – only «currently» active code paths need to be resident in memory, the rest can be discarded when inactive (or never even gets loaded into memory to start off with) and paged back into memory on demand. Since code pages are read only, the inactive code pages can be just dropped without any detriment to the application whilst reducing the app's memory footprint.
> […] typical executable is usually several megabytes
Executable size != the size of the actually running code.
In modern operating systems with advanced virtual memory management systems, the actual resident code size can go as low as several kilobytes (or, rather, a handful of pages). This, of course, depends on whether the hot paths in the code have a close affinity to each other in the linked executable.
> But wouldn't it be better to avoid writing such programs?
Think long-term recording applications, such as audio or studio situations where you want to "fire and forget" reliable recording systems of large amounts of data consistently from multiple streams for extended durations, for example.
> When I program, my application may sometimes allocate a lot of memory due to some silly bug
I had one of those cases a few years ago when a program I was working on was leaking 12 MP raw image buffers in a drawing routine. I set it off running and went browsing HN/chatting with friends. A few minutes later I was like "this process is definitely taking too long" and when I went to check on it, it was using up 200+ GB of RAM (on a 16 GB machine) which had all gone to swap.
I hadn't noticed a thing! Modern SSDs are truly a marvel... (this was also on macOS rather than Linux, which may have a better swap implementation for desktop purposes)
> When I program, my application may sometimes allocate a lot of memory due to some silly bug. In such case the whole system practically stops working [...]
You can limit resource usage per process thus your buggy application could be killed long before the system comes to a crawl. See your shell' s entry on its limit/ulimit built-in or use
man prlimit(1) - get and set process resource limits
Programs run from program text, program text is mapped in as named pages (disk cache). They are evictable! And without swap, they will get evicted on high memory pressure. Program text thrashing is worse than having swap.
The problem is not the existence of swap, but that people are unaware that the disk cache is equally important for performance.
VM_EXEC pages are explicitly deprioritized from the reclaim by the kernel. Unlike any other pages, they are put into the active LRU on the first use and remain in the active LRU if they are active.
It's yet another old crap - to load program code from disk on-demand. Nowadays it's just easier to load the whole executable into memory and always preserve it.
Kinda, basically. Swap is a cost optimization for "bad" programs.
Having more RAM is always better performance, but swap allows you to skimp out on RAM in certain cases for almost identical performance but lower cost (of buying more RAM), if you run programs that allocate a lot of memory that it subsequently doesn't use. I hear Java is notoriously bad at this, so if you run a lot of heavy enterprise Java software, swap can get you the same performance with half the RAM.
(It is also a "GC strategy", or stopgap for memory leaks. Rather than managing memory, you "could" just never free memory, and allocate a fat blob of swap and let the kernel swap it out.)
> But wouldn't it be better to avoid writing such programs?
Yes, indeed, the world would be a better place if we had just stopped writing Java 20 years ago.
> And how many memory such daemons can consume? A couple of hundred megabytes total?
Consider the average Java or .net enterprise programmer, who spends his entire career gluing together third-party dependencies without ever understanding what he's doing: Your executable is a couple hundred megabytes already, then you recursively initialize all the AbstractFactorySingletonFactorySingletonFactories with all their dependencies monkey patched with something worse for compliance reasons, and soon your program spends 90 seconds simply booting up and sits at two or three dozen gigabytes of memory consumption before it has served its first request.
> Is it really that much on modern systems?
If each of your Java/.net business app VMs needs 50 or so gigabytes to run smoothly, you can only squeeze ten of them in an 1U pizza box with a mere half terabyte RAM; while modern servers allow you to cram in multiple terabytes, do you really want to spend several tens of thousands of dollars on extra RAM, when swap storage is basically free?
Cloud providers do the same math, and if you look at e.g. AWS, swap on EBS costs as much per month as the same amount of RAM costs per hour. That's almost three orders of magnitude cheaper.
> When I program, my application may sometimes allocate a lot of memory due to some silly bug.
Yeah, that's on you. Many, many mechanism let you limit the per-process memory consumption.
But as TFA tries to explain, dealing with this situation is not the purpose of swap, and never has been. This is a pathological edge case.
> almost all used memory is now in swap and the whole system works snail-slow, presumably because kernel doesn't think it should really unswap previously swapped memory and does this only on demand and only page by page.
This requires multiple conditions to be met
- the broken program is allocating a lot of RAM, but not quickly enough to trigger the OOM killer before everything has been swapped out
- you have a lot of swap (do you follow the 1990s recommendation of having 1-2x the RAM amount as swap?)
- the broken program sits in the same cgroup as all the programs you want to keep working even in an OOM situation
Condition 1 can't really be controlled, since it's a bug anyway.
Condition 2 doesn't have to be met unless you explicitly want it to. Why do you?
Condition 3 is realistically on desktop environments, despite years of messing around with flatpaks and snaps and all that nonsense they're not making it easy for users to isolate programs they run that haven't been pre-containerized.
But simply reducing swap to a more realistic size (try 4GB, see how far it gets you) will make this problem much less dramatic, as only parts of the RAM have to get flushed back.
> I a hypothetical case without swap this case isn't so painful. When main system memory is almost fully consumed, OOM killer kills the most memory hungry program and all other programs just continue working as before.
And now you're wasting RAM that could be used for caching file I/O. Have you benchmarked how much time you're wasting through that?
> I think that overall reliance on swap is noways just a legacy of old times when main memory was scarce and back than it maybe was useful to have swap.
No, you just still don't understand the purpose of swap.
Also, "old times"? You mean today? Because we still have embedded environments, we have containers, we have VMs, almost all software not running on a desktop is running in strict memory constraints.
> and kernel code may be simpler (all this swapping code may be removed)
So you want to remove all code for file caching? Bold strategy.
Swapping (or, rather, paging – I don't think there is an operating system in existence today that swaps out entire processes) does not make modern systems slower – it is a delusion and an urban legend that originated in the sewers of the intertubes and is based on an uninformed opinion rather than the understanding and knowledge of how virtual memory systems work. It has been regurgitated to death, and the article explains it really well why it is a delusion.
20-30 years ago, heavy paging often crippled consumer Intel based PC's[0] because paging went to slow mechanical hard disks on PATA/IDE, a parallel device bus (until 2005 circa), which had little parallelism and initially no native command queuing; SCSI drives did offer features such as tagged command queuing and efficient scatter-gather but were uncommon on desktops leave alone laptops. Today the bottlenecks are largely gone – abundant RAM, switched interconnects such as PCIe, SATA with NCQ/AHCI, and solid-state storage, especially NVMe, provide low-latency, highly parallel I/O – so paging still signals memory pressure yet is far less punishing on modern laptops and desktops.
Swap space today has a quieter benefit: lower energy use. On systems with LPDDR4/LPDDR5, the memory controller can place inactive banks into low-power or deep power-down states; by compressing memory and paging out cold, dirty pages to swap, the OS reduces the number of banks that must stay active, cutting DRAM refresh and background power. macOS on Apple Silicon is notably aggressive with memory compression and swap and works closely with the SoC power manager, which can contribute to the strong battery life of Apple laptops compared with competitors, albeit this is only one factor amongst several.
[0] RISC workstations and servers have had switched interconnects since day 1.
In my humble experience, if you run out of memory in Linux you are f... up, irrespective of swap present and/or OOM getting in.
On the other side, a Raspberry Pi freezed unexpectedly (not due to low memory) until a very small swap file was enabled. It was almost never used but the freezes stopped. Fun swap stories.
>There's also a lot of misunderstanding about the purpose of swap – many people just see it as a kind of "slow extra memory" for use in emergencies, but don't understand how it can contribute during normal load to the healthy operation of an operating system as a whole.
That's the long-standing defect that needs to be corrected then, there should be no dependence on swap existing whatsoever as long as you have more than enough memory for the entire workload.
The way I learned it, swap is basically the inverse of file caching: in much the way that extra memory can be used to cache more frequently-used files, then evicted when a "better" use for that memory comes around; swap can be used to save rarely-used anonymous memory so that you can "evict" them when there are other things you'd rather have in memory, then pull them back into memory if they ever become relevant again.
Or to look at it from another perspective... it lets you reclaim unprovable memory leaks.
Author pushes abstract idea about "page reclamation" in front of ideas of performance, reliability and controllable service degradation which people actually want; because author believes that it is the one and only solution to them; and then defends swap because it is good for it.
No, this is just plain wrong. There are very specific problems which happen when there is not enough memory.
1. File-backed page reads causing more disk reads, eventually ending with "programs being executed from disk" (shared libraries are also mmaped) which feels like system lockup. This does not need any "egalitarian reclamation" abstraction and swap, and swap does not solve it. But it can be solved simply by reserving some minimal amount of memory for buf/cache, with which system is still responsive.
2. Eventually failure to allocate more memory for some process. Any solutions like "page reclamation" with pushing unused pages to some swap can only increase maximum amount of memory which can be used before it happens, from one finite value to bigger finite value. When there is no memory to free without losing data, some process must be killed. Swap does not solve this. The least bad solution would be to warn user in advance and let them choose processes to kill.
Neither executables nor shared libraries are going to be evicted if they are in active use and have the "accessed" bit set in their page tables. This code has been present in the kernel mm/vmscan.c at least since 2012.
This wasn't on Linux, but on one of the old-school commercial Unixes - a customer had memory leaks in some of their daemon processes. They couldn't fix them for some reason.
So they invested in additional swap space, let the processes slowly grow, swap out leaked stuff and restart them all over the weekend...
I wrote a chat server back in the 2000s which would gradually use more and more memory over a period of months. After extensive debugging, I couldn't find any memory leak and concluded the problem was likely inside glibc or caused by memory fragmentation. Solution was to have a cron job that ran every 3 months and rebooted the machine.
> 6. Disabling swap doesn't prevent pathological behaviour at near-OOM, although it's true that having swap may prolong it. Whether the global OOM killer is invoked with or without swap, or was invoked sooner or later, the result is the same: you are left with a system in an unpredictable state. Having no swap doesn't avoid this.
This is the most important reason I try to avoid having a large swap. The duration of pathological behavior at near-OOM is proportional to the amount of swap you have. The sooner your program is killed, the sooner your monitoring system can detect it ("Connection refused" is much more clear cut than random latency spikes) and reboot/reprovision the faulty server. We no longer live in a world where we need to keep a particular server online at all cost. When you have an army of servers, a dead server is preferable to a misbehaving server.
OP tries to argue that a long period of thrashing will give you an opportunity for more visibility and controlled intervention. This does not match my experience. It takes ages even to log in to a machine that is thrashing hard, let alone run any serious commands on it. The sooner you just let it crash, the sooner you can restore the system to a working state and inspect the logs in a more comfortable environment.
That assumes the OOM killer kills the right thing. It may well choose to kill something ancillary, which causes your OOM program to just hang or misbehave wildly.
The real danger in all of this, swap or no, is the shitty OOMKiller in Linux.
The OOM killer will be just as shitty whether you have swap or not. But the more swap you have, the longer your program will be allowed to misbehave. I prefer a quick and painless death.
> OP tries to argue that a long period of thrashing will give you an opportunity for more visibility and controlled intervention.
I didn't get that impression. My read was that OP was arguing for user-space process killers so the system doesn't get to the point where the system becomes unresponsive due to thrashing.
> With swap: ... We have more visibility into the instigators of memory pressure and can act on them more reasonably, and can perform a controlled intervention.
But of course if you're doing this kind of monitoring, you can probably just check your processes' memory usage and curb them long before they touch swap.
Maybe I'm just insane, but if I'm on a machine with ample memory, and a process for some reason can't allocate resources, I want that process to fail ASAP. Same thing with high memory pressure situations, just kill greedy/hungry processes, please.
Like something is going very wrong if the system is in that state, so I want everything to die immediately.
sysctl vm.overcommit_memory=2. However, programs for *nix-based systems usually expect overcommit to be on, for example, to support fork(). This is a stark contrast with Windows NT model, where an allocation will fail if it doesn't fit in the remaining memory+swap.
People disable memory overcommit, expecting to fix OOMs, and then they get surprised when their programs start failing mallocs while there are still tons of discardable page cache in the system.
I enable oom_kill on sysrq, so I can hit alt+sysrq+f to invoke OOM; in /etc/sysctl.d/10-magic-sysrq.conf I have `kernel.sysrq = 240` (ie. 128+64+32+16, 128 being the one for the f key).
Welcome to the wonderful world of Java programs. When your tomcat abomination pulls in 500 dependencies for one method call each, and 80% of the methods aren't even called in regular use except to perform dependency injection mumbo jumbo during the 90 seconds your tomcat needs to start up, you easily end up with 70% of your application's anon pages being completely useless, but if you can't banish them to swap, they'll prevent the code on the hot path from having any memory left over for file caching.
So even if you never run into OOM situations, adding a couple gigabytes of swap lets you free up that many gigabytes of RAM for file caching, and suddenly your application is on average 5x faster - but takes 3 seconds longer to service that one obscure API call that needs to dig all those pages back up. YMMV if you prefer consistently poor performance over inconsistent but usually much better performance.
Java's performance for cold code is bad period. This doesn't really have to do with code being paged out (that very rarely happens) but due to the JIT compiler not having warmed up the appropriate execution paths so that it runs in interpreted mode, often made worse as static object initialization happening when the first code that needs that particular class runs, and if you're unlucky with how the system was designed that may introduce cascading class initialization.
Though any halfway competent Java developer following modern best practices will know to build systems that don't have these characteristics.
I've always created swap of 1.5x - 4x RAM size on every Linux computer I've had to manage and never had any issues with it. That's my rule that I learned many years ago, follow to this day and will follow.
Worst thing: I left 5% of my SSD unused which will actually be used for garbage collection and other staff. That's OK.
What I don't understand is why modern Linux is so shy of touching swap. With old kernels, Linux happily pushed unused pages to a swap, so even if you don't eat memory, your swap will be filled with tens or hundreds MB of memory and that's a great thing. Modern kernel just keeps swap usage at 0, until memory is exhausted.
> What I don't understand is why modern Linux is so shy of touching swap. With old kernels, Linux happily pushed unused pages to a swap, so even if you don't eat memory, your swap will be filled with tens or hundreds MB of memory and that's a great thing. Modern kernel just keeps swap usage at 0, until memory is exhausted.
On my desktop system, most of my problems with swap come from dealing with the aftermath of an out-of-control process eating all my RAM. In this case, the offending program demands memory so quickly that everything from legitimate programs gets swapped out. These programs proceed to run poorly for the next several minutes to an hour depending on usage, since the OS only swaps pages back in once they are referenced, even if there is plenty of free space not even being used in the disk cache.
Eventually I wrote a small script that does the equivalent of "sudo swapoff -a && sudo swapon -a" to eagerly flush everything to RAM, but I was surprised by how many people seemed to think there's no legitimate reason to ever want to do so.
That works if there is enough memory after the "bad" process has been killed. The question is, is it necessary? Many systems can live with processes performing a little bit poorly for some minutes and I wouldn't do it.
It's fine that "many systems" can. But there is no easy way when the user or system can't. Flushing back to RAM is slow - that's not controversial. So it would help if there was a way to do this in advance of the need for the programs where that matters.
You mean like vmtouch and madvise?
I use vmtouch all the time to preload or even lock certain data/code into RAM.
> The question is, is it necessary? Many systems can live with processes performing a little bit poorly for some minutes and I wouldn't do it.
The outage ain't resolved until things are back to operating normally.
If things aren't back to 100% healthy, could be I didn't truly find the root cause of the problem - in which case I'll probably be woken up again in 30 minutes when the problem comes back.
The article has not mentioned memory compression as an alternative to swap which many Linux distributions enable by default.
On the other hand these days latest SSD are way faster than memory compression even with LUKS encryption on and even when compression uses LZ4 compression. Plus modern SSDs do not suffer from frequent writes as before so on my laptop I disabled the memory compression and then all reasoning from the article applies again.
Then on a development laptop running compilations/containers/VMs/browser vm.swappines does not seems matter that much if one has enough memory. So I no longer tune it to 100 or more and leave at the default 60%.
> these days latest SSD are way faster than memory compression
That's a really provocative claim. Any benchmarks to support this?
I wish people would actually read TFA instead of reflexively repeating nonsensical folk remedies.
I've been telling people about this since the days when there were operating systems still around that actually did swapping (16-bit OS/2, old Unix, Standard Mode DOS+Windows) rather than paging (32-bit OS/2, 386 Enhanced Mode DOS+Windows, Windows NT). I wrote a Frequently Given Answer about it in 2007, I had had to repeat the point so many times since the middle 1990s; and I was far from alone even then.
* http://jdebp.uk./FGA/dont-throw-those-paging-files-away.html
The erroneous folk wisdom is widespread. It often seems to lack any mention of the concepts of a resident set and a working set, and is always mixed in with a wishful thinking idea that somehow "new" computers obviate this, when the basic principles of demand paging are the same as they were four decades ago, Parkinson's Law can still be observed operating in the world of computers, and the "new" computers all of those years ago didn't manage to obviate paging files either.
The swapfile.sys in Windows 8+ is used for process swapping (moving the entire private working set out of memory to disk), but only for UWP applications.
Recognition that older linux swap strategies were unhelpful sometimes, which this piece of writing does, validates out past sense it wasn't working well. Regaining trust takes time.
Sometimes I think if backing store and swap were more clearly delineated we might have got to decent algorithms sooner. Having a huge amount of swap pre-emptively claimed was making it look like starvation, when it was just a runtime planning strategy. It's also confusing how top and vmstat report things.
Also, as a BSD mainly person, I think the differences stand out. I haven't noticed an OOM killer approach on BSD.
Ancient model: twice as much swap as memory
Old model: same amount of swap as memory
New model: amount of swap your experience tells you this job mix demands to manage memory pressure fairly, which is a bit of a tall ask sometimes, but basically pick a number up to memory size.
> Also, as a BSD mainly person, I think the differences stand out. I haven't noticed an OOM killer approach on BSD.
BSD allocators simply return errors if no more memory is available; for backwards compatibility reasons Linux is stuck with a fatally flawed API that doesn't.
You can trivially disable overcommit on Linux (vm.overcommit_memory=2) to get allocation failures instead of OOMs. But you will find yourself spending a lot more money on RAM :)
And debug many tools which still ignore the fact that malloc could fail.
I assumed the same, but just discovered that FreeBSD has vm.overcommit too. But I'm not sure about its working.
Overcommit is subtle. If you allocate a bunch of address space and don't touch it, that's one thing.
If you allocate and touch everything, and then try to allocate more, it's better to get an allocation error than an unsatifyable page fault later.
My understanding (which could very well be wrong) is Linux overcommit will continue to allocate address space when asked regardless of memory pressure; but FreeBSD overcommit will refuse allocations when there's too much memory pressure.
I'm pretty sure I've seen FreeBSD's OOM killer, but it needs a specific pattern of memory use, it's much more likely for an application to get a failed allocation and exit, freeing memory; than for all the applications to have unused allocations that they then use.
All that said, I prefer to run with a small swap, somewhere around 0.5-2GB. Memory pressure is hard to measure (although recent linux has a measure that I haven't used), but swap % and swap i/o are easy to measure. If your swap grows quickly, you might not have time to do any operations to fix it, but your stats should tell the tale. If your swap grows slowly enough, you can set thresholds and analyze the situation. If you have a lot of swap i/o that provides a measure of urgency.
For modern Linux servers with large amounts of RAM, my rule of thumb is between 1/8 and 1/32 of RAM, depending on what the machine is for.
For example, one of my database servers has 128GB of RAM and 8GB of swap. It tends to stabilize around 108GB of RAM and 5GB of swap usage under normal load, so I know that a 4GB swap would have been less than optimal. A larger swap would have been a waste as well.
I no longer use disk swap for servers, instead opting for Zram with a maximum is 50% of RAM capacity and a high swapiness value.
It'd be cool if Zram could apply to the RAM itself (like macOS) rather than needing a fake swap device.
Lookie lookie! Isn't it spooky?
https://github.com/CachyOS/CachyOS-Settings/blob/master/usr/...
Resulting in https://i.postimg.cc/hP37vvpJ/screenieshottie.png
Good enough...
Yeh. I haven't yet figured out how to get zram to apply transparently to containers though, anything in another memory cgroup will never get compressed unless swap is explicitly exposed to it.
zswap
https://docs.kernel.org/admin-guide/mm/zswap.html
The cgroup accounting also now works in zswap.
Zswap requires a backing disk swap, Zram does not.
The proper rule of thumb is to make the swap large enough to keep all inactive anonymous pages after the workload has stabilized, but not too large to cause swap thrashing and a delayed OOM kill if a fast memory leak happens.
Another rule of thumb is that performance degradation due to the active working set spilling into the swap is exponential - 0.1% excess causes 2x degradation, 1% - 10x degradation, 10% - 100x degradation (assuming 10^3 difference in latency between RAM and SSD).
I would approach the issue from the other direction. Start by buying enough RAM to contain the active working set for the foreseeable future. Afterward, you can start experimenting with different swap sizes (swapfiles are easier to resize, and they perform exactly as well as swap partitions!) to see how many inactive anonymous pages you can safely swap out. If you can swap out several gigabytes, that's a bonus! But don't take that for granted. Always be prepared to move everything back into RAM when needed.
I am testing a distributed database-like system at work that makes heavy use of swap. At startup, we read a table from S3 and compute a recursive materialized view over it. This needs about 4TB of “memory” per node while computing, which we provide as 512gb of RAM + 3900GB of NVMe zswap enabled swap devices. Once the computation is complete, we’re left with a much smaller working set index (about 400gb) we use to serve queries. For this use-case, swap serves as a performant and less labor intensive approach to manually spilling the computation to disk in application code (although there is some mlock going on; it’s not entirely automatic). This is like a very extreme version of the initialization-only pages idea discussed in the articule.
The warm up computation does take like 1/4 the time if it can live entirely in RAM, but using NVMe as “discount RAM” reduces the United States dollar cost of the system by 97% compared to RAM-only.
The problem with heavy swapping on NVMe (or other flash memory) is that it wears out the flash storage very quickly, even for seemingly "reasonable" workloads. In a way, the high performance of NVMe can work against you. Definitely something you want to check out via SMART or similar wearout stats.
Let’s say we’re spending $1 million on hardware hypothetically with the swap setup.
At that price point, either we use swap and let the kernel engineers move data from RAM to disk and back, or we disable swap and need user space code to move the same data to disk and back. We’d need to price out writing & maintaining the user space implementation (mmap perhaps?) for it to be fair price comparison.
To avoid SSD wear and tear, we could spend $29 million a year more to put the data in RAM only. Not worth!
(We rent EC2 instances from AWS, so SSD wear is baked into the pricing)
While what you stated is overall not true, who cares with a 97% cost savings vs RAM? Just pop in another NVMe when one fails.
Not an issue for the commenter – since they have mentioned S3, they are either using AWS EBS or instance attached scratch NVMe's which the vendor (AWS) takes care of.
The AWS control plane will detect an ailing SSD backing up the EBS and will proactively evacuate the data before the physical storage goes pear shaped.
If it is an EC2 instance with an instance attached NVMe, the control plane will issue an alert that can be automatically acted upon, and the instance can be bounced with a new EC2 instance allocated from a pool of the same instance type and get a new NVMe. Provided, of course, the design and implementation of the running system are stateless and can rebuild the working set upon a restart.
Have you considered having one box with 4TB of RAM to do the computation, then sending it around to all the other nodes?
Each node handles an independent ~4TB shard of data in horizontal scale-out fashion. Perhaps we could try some complex shenanigans where we rent 4TB RAM nodes, compute, send to 512GB RAM nodes then terminate the 4TB nodes but that’s a bunch of extra complexity for not much of a win.
What's the reduction of cost measured in Euros though?
As I understand this article, swap is useful for cases where many long-lived programs (daemons) allocate a lot of memory, but almost never access it. But wouldn't it be better to avoid writing such programs? And how many memory such daemons can consume? A couple of hundred megabytes total? Is it really that much on modern systems?
My experience with swap shows, that it only makes things worse. When I program, my application may sometimes allocate a lot of memory due to some silly bug. In such case the whole system practically stops working - even mouse cursor can't move. If I am happy, OOM killer will eventually kill my buggy program, but after that it's not over - almost all used memory is now in swap and the whole system works snail-slow, presumably because kernel doesn't think it should really unswap previously swapped memory and does this only on demand and only page by page.
I a hypothetical case without swap this case isn't so painful. When main system memory is almost fully consumed, OOM killer kills the most memory hungry program and all other programs just continue working as before.
I think that overall reliance on swap is noways just a legacy of old times when main memory was scarce and back than it maybe was useful to have swap. OS kernels should be redesigned to work without swap, this will make system behavior smoother and kernel code may be simpler (all this swapping code may be removed) and thus faster.
You may benefit by reducing your swap size significantly.
The old rule of thumb of 1-2x your ram is way too much for most systems. The solution isn't to turn it off, but to have a sensible limit. Try with half a gig of swap and see how that does. It may give you time to notice the system is degraded and pick something to kill yourself and maybe even debug the memory issue if needed. You're not likely to have lasting performance issues from too many things swapped out after you or the OOM killer end the memory pressure, because not much of your memory will fit in swap.
> As I understand this article, swap is useful for cases where many long-lived programs (daemons) allocate a lot of memory, but almost never access it. But wouldn't it be better to avoid writing such programs?
Ideally yes, but is that something you keep in mind when you write software? Do you ever consider freeing memory just because it hasn't been used in a while? How do you decide when to free it? This is all handled automatically when you have swap enabled, and at a granularity that is much higher than you can practically manually implement it.
I write mostly C++ or Rust programs. In these languages memory is freed as soon as it's no longer in use (thanks to destructors). So, usually this shouldn't be actively kept in mind. The only exception are cases like caches, but long-running programs should use caching carefully - limit cache size and free cache entries after some amount of time.
Programs, which allocate large amounts of memory without strict necessity to do so, are just a consequence of swap existence. "Thanks" to swap they weren't properly tested in low-memory conditions and thus no necessary optimization were done.
You'll also need to consider that the allocator you're using may not immediately free memory to the system. That memory is free to be used by your application but considered as used memory mapped to your program.
Anyway, it's easy to discuss best practices but people actually following them is the actual issue. If you disable swap and the software you're running isn't optimized to minimize idle memory usage then your system will be forced to keep all of that data in RAM.
You are both confusing swap and memory overcommit policy. You can disable swap by compiling the kernel with `CONFIG_SWAP=no`, but it won't change the memory overcommit policy, and programs would still be able to allocate more memory than available on the system. There is no problem in allocating the virtual memory - if it isn't used, it never gets mapped to the physical memory. The problem is when a program tries to use more memory than the system has, and you will get OOMs even with the swap disabled. You can disable memory overcommit, but this is only going to result in malloc() failing early while you still have tons of memory.
A side note, stack memories are usually not physically returned to the OS. When (de)allocating on stack, only the stack pointer is moved within the pages preallocated by the OS.
> In these languages memory is freed as soon as it's no longer in use (thanks to destructors).
Unless you have an almost pathological attention to detail, that is not true at all. And even if you do precisely scope your destructors, the underlying allocator won't return the memory to the OS (what matters here) immediately.
> Programs, which allocate large amounts of memory without strict necessity to do so, are just a consequence of swap existence. "Thanks" to swap they weren't properly tested in low-memory conditions and thus no necessary optimization were done.
Who told you this? It's not remotely true.
Here's an article about this subject that you might want to read:
https://chrisdown.name/2018/01/02/in-defence-of-swap.html
And were you aware that freeing memory only allows it to be reallocated within your process but doesn't actually release it from your process? State-of-the-art general-purpose allocators are actually still kind of shit.
> I a hypothetical case without swap this case isn't so painful. When main system memory is almost fully consumed, OOM killer kills the most memory hungry program
That's not how it works in practice. What happens is that program pages (and read-only data pages) get gradually evicted from memory and the system still slows to a crawl (to the point where it becomes practically unresponsive) because every access to program text outside the current 4KB page now potentially involves a swap-in. Sure, eventually, the memory-hungry task will either complete successfully or the OOM killer will be called, but that doesn't help you if you care about responsiveness first and foremost (and in practice, desktop users do care about that - especially when they're trying to terminate that memory hog).
Why not just always preserving program code in memory? It's usually not that much - typical executable is usually several megabytes in size and many processes can share the same code memory pages (especially with shared libraries).
> It's usually not that much - typical executable is usually several megabytes in size and many processes can share the same code memory pages (especially with shared libraries)
Have a look at Chrome. Then have a look at all the Electron "desktop" apps, which all ship with a different Chrome version and different versions of shared libraries, which all can't share memory pages, because they're subtly different. You find similar patterns across many, many other workloads.
Or modern languages, like Rust and Go, which have decided that runtime dependencies are too hard and instead build enormous static binaries for everything.
Programs and shared libraries (pages with VM_EXEC attribute) are kept in the memory if they are actively used (have the "accessed" bit set by the CPU) and are least likely to be evicted.
> Why not just always preserving program code in memory?
Because the code is never required in its entirety – only «currently» active code paths need to be resident in memory, the rest can be discarded when inactive (or never even gets loaded into memory to start off with) and paged back into memory on demand. Since code pages are read only, the inactive code pages can be just dropped without any detriment to the application whilst reducing the app's memory footprint.
> […] typical executable is usually several megabytes
Executable size != the size of the actually running code.
In modern operating systems with advanced virtual memory management systems, the actual resident code size can go as low as several kilobytes (or, rather, a handful of pages). This, of course, depends on whether the hot paths in the code have a close affinity to each other in the linked executable.
> But wouldn't it be better to avoid writing such programs?
Think long-term recording applications, such as audio or studio situations where you want to "fire and forget" reliable recording systems of large amounts of data consistently from multiple streams for extended durations, for example.
Why wouldn't you write that data to disk? Holding it all in RAM isn't exactly a reliable way of storing data.
What do you think is happening with swap, exactly?
> When I program, my application may sometimes allocate a lot of memory due to some silly bug
I had one of those cases a few years ago when a program I was working on was leaking 12 MP raw image buffers in a drawing routine. I set it off running and went browsing HN/chatting with friends. A few minutes later I was like "this process is definitely taking too long" and when I went to check on it, it was using up 200+ GB of RAM (on a 16 GB machine) which had all gone to swap.
I hadn't noticed a thing! Modern SSDs are truly a marvel... (this was also on macOS rather than Linux, which may have a better swap implementation for desktop purposes)
> When I program, my application may sometimes allocate a lot of memory due to some silly bug. In such case the whole system practically stops working [...]
You can limit resource usage per process thus your buggy application could be killed long before the system comes to a crawl. See your shell' s entry on its limit/ulimit built-in or use
man prlimit(1) - get and set process resource limits
Programs run from program text, program text is mapped in as named pages (disk cache). They are evictable! And without swap, they will get evicted on high memory pressure. Program text thrashing is worse than having swap.
The problem is not the existence of swap, but that people are unaware that the disk cache is equally important for performance.
VM_EXEC pages are explicitly deprioritized from the reclaim by the kernel. Unlike any other pages, they are put into the active LRU on the first use and remain in the active LRU if they are active.
It's yet another old crap - to load program code from disk on-demand. Nowadays it's just easier to load the whole executable into memory and always preserve it.
Kinda, basically. Swap is a cost optimization for "bad" programs.
Having more RAM is always better performance, but swap allows you to skimp out on RAM in certain cases for almost identical performance but lower cost (of buying more RAM), if you run programs that allocate a lot of memory that it subsequently doesn't use. I hear Java is notoriously bad at this, so if you run a lot of heavy enterprise Java software, swap can get you the same performance with half the RAM.
(It is also a "GC strategy", or stopgap for memory leaks. Rather than managing memory, you "could" just never free memory, and allocate a fat blob of swap and let the kernel swap it out.)
> But wouldn't it be better to avoid writing such programs?
Yes, indeed, the world would be a better place if we had just stopped writing Java 20 years ago.
> And how many memory such daemons can consume? A couple of hundred megabytes total?
Consider the average Java or .net enterprise programmer, who spends his entire career gluing together third-party dependencies without ever understanding what he's doing: Your executable is a couple hundred megabytes already, then you recursively initialize all the AbstractFactorySingletonFactorySingletonFactories with all their dependencies monkey patched with something worse for compliance reasons, and soon your program spends 90 seconds simply booting up and sits at two or three dozen gigabytes of memory consumption before it has served its first request.
> Is it really that much on modern systems?
If each of your Java/.net business app VMs needs 50 or so gigabytes to run smoothly, you can only squeeze ten of them in an 1U pizza box with a mere half terabyte RAM; while modern servers allow you to cram in multiple terabytes, do you really want to spend several tens of thousands of dollars on extra RAM, when swap storage is basically free?
Cloud providers do the same math, and if you look at e.g. AWS, swap on EBS costs as much per month as the same amount of RAM costs per hour. That's almost three orders of magnitude cheaper.
> When I program, my application may sometimes allocate a lot of memory due to some silly bug.
Yeah, that's on you. Many, many mechanism let you limit the per-process memory consumption.
But as TFA tries to explain, dealing with this situation is not the purpose of swap, and never has been. This is a pathological edge case.
> almost all used memory is now in swap and the whole system works snail-slow, presumably because kernel doesn't think it should really unswap previously swapped memory and does this only on demand and only page by page.
This requires multiple conditions to be met
- the broken program is allocating a lot of RAM, but not quickly enough to trigger the OOM killer before everything has been swapped out
- you have a lot of swap (do you follow the 1990s recommendation of having 1-2x the RAM amount as swap?)
- the broken program sits in the same cgroup as all the programs you want to keep working even in an OOM situation
Condition 1 can't really be controlled, since it's a bug anyway.
Condition 2 doesn't have to be met unless you explicitly want it to. Why do you?
Condition 3 is realistically on desktop environments, despite years of messing around with flatpaks and snaps and all that nonsense they're not making it easy for users to isolate programs they run that haven't been pre-containerized.
But simply reducing swap to a more realistic size (try 4GB, see how far it gets you) will make this problem much less dramatic, as only parts of the RAM have to get flushed back.
> I a hypothetical case without swap this case isn't so painful. When main system memory is almost fully consumed, OOM killer kills the most memory hungry program and all other programs just continue working as before.
And now you're wasting RAM that could be used for caching file I/O. Have you benchmarked how much time you're wasting through that?
> I think that overall reliance on swap is noways just a legacy of old times when main memory was scarce and back than it maybe was useful to have swap.
No, you just still don't understand the purpose of swap.
Also, "old times"? You mean today? Because we still have embedded environments, we have containers, we have VMs, almost all software not running on a desktop is running in strict memory constraints.
> and kernel code may be simpler (all this swapping code may be removed)
So you want to remove all code for file caching? Bold strategy.
Swapping (or, rather, paging – I don't think there is an operating system in existence today that swaps out entire processes) does not make modern systems slower – it is a delusion and an urban legend that originated in the sewers of the intertubes and is based on an uninformed opinion rather than the understanding and knowledge of how virtual memory systems work. It has been regurgitated to death, and the article explains it really well why it is a delusion.
20-30 years ago, heavy paging often crippled consumer Intel based PC's[0] because paging went to slow mechanical hard disks on PATA/IDE, a parallel device bus (until 2005 circa), which had little parallelism and initially no native command queuing; SCSI drives did offer features such as tagged command queuing and efficient scatter-gather but were uncommon on desktops leave alone laptops. Today the bottlenecks are largely gone – abundant RAM, switched interconnects such as PCIe, SATA with NCQ/AHCI, and solid-state storage, especially NVMe, provide low-latency, highly parallel I/O – so paging still signals memory pressure yet is far less punishing on modern laptops and desktops.
Swap space today has a quieter benefit: lower energy use. On systems with LPDDR4/LPDDR5, the memory controller can place inactive banks into low-power or deep power-down states; by compressing memory and paging out cold, dirty pages to swap, the OS reduces the number of banks that must stay active, cutting DRAM refresh and background power. macOS on Apple Silicon is notably aggressive with memory compression and swap and works closely with the SoC power manager, which can contribute to the strong battery life of Apple laptops compared with competitors, albeit this is only one factor amongst several.
[0] RISC workstations and servers have had switched interconnects since day 1.
In my humble experience, if you run out of memory in Linux you are f... up, irrespective of swap present and/or OOM getting in.
On the other side, a Raspberry Pi freezed unexpectedly (not due to low memory) until a very small swap file was enabled. It was almost never used but the freezes stopped. Fun swap stories.
>There's also a lot of misunderstanding about the purpose of swap – many people just see it as a kind of "slow extra memory" for use in emergencies, but don't understand how it can contribute during normal load to the healthy operation of an operating system as a whole.
That's the long-standing defect that needs to be corrected then, there should be no dependence on swap existing whatsoever as long as you have more than enough memory for the entire workload.
Nothing about zram/zswap? I know that zram is more performant but I wonder how it holds up under high memory pressure compared to zswap.
The way I learned it, swap is basically the inverse of file caching: in much the way that extra memory can be used to cache more frequently-used files, then evicted when a "better" use for that memory comes around; swap can be used to save rarely-used anonymous memory so that you can "evict" them when there are other things you'd rather have in memory, then pull them back into memory if they ever become relevant again.
Or to look at it from another perspective... it lets you reclaim unprovable memory leaks.
It's crazy to me that even Fedora disables swap to disk by default now. It really speaks to how broadly misunderstood swap is
Author pushes abstract idea about "page reclamation" in front of ideas of performance, reliability and controllable service degradation which people actually want; because author believes that it is the one and only solution to them; and then defends swap because it is good for it.
No, this is just plain wrong. There are very specific problems which happen when there is not enough memory.
1. File-backed page reads causing more disk reads, eventually ending with "programs being executed from disk" (shared libraries are also mmaped) which feels like system lockup. This does not need any "egalitarian reclamation" abstraction and swap, and swap does not solve it. But it can be solved simply by reserving some minimal amount of memory for buf/cache, with which system is still responsive. 2. Eventually failure to allocate more memory for some process. Any solutions like "page reclamation" with pushing unused pages to some swap can only increase maximum amount of memory which can be used before it happens, from one finite value to bigger finite value. When there is no memory to free without losing data, some process must be killed. Swap does not solve this. The least bad solution would be to warn user in advance and let them choose processes to kill.
See also https://github.com/hakavlad/prelockd
Neither executables nor shared libraries are going to be evicted if they are in active use and have the "accessed" bit set in their page tables. This code has been present in the kernel mm/vmscan.c at least since 2012.
This wasn't on Linux, but on one of the old-school commercial Unixes - a customer had memory leaks in some of their daemon processes. They couldn't fix them for some reason.
So they invested in additional swap space, let the processes slowly grow, swap out leaked stuff and restart them all over the weekend...
I wrote a chat server back in the 2000s which would gradually use more and more memory over a period of months. After extensive debugging, I couldn't find any memory leak and concluded the problem was likely inside glibc or caused by memory fragmentation. Solution was to have a cron job that ran every 3 months and rebooted the machine.
> 6. Disabling swap doesn't prevent pathological behaviour at near-OOM, although it's true that having swap may prolong it. Whether the global OOM killer is invoked with or without swap, or was invoked sooner or later, the result is the same: you are left with a system in an unpredictable state. Having no swap doesn't avoid this.
This is the most important reason I try to avoid having a large swap. The duration of pathological behavior at near-OOM is proportional to the amount of swap you have. The sooner your program is killed, the sooner your monitoring system can detect it ("Connection refused" is much more clear cut than random latency spikes) and reboot/reprovision the faulty server. We no longer live in a world where we need to keep a particular server online at all cost. When you have an army of servers, a dead server is preferable to a misbehaving server.
OP tries to argue that a long period of thrashing will give you an opportunity for more visibility and controlled intervention. This does not match my experience. It takes ages even to log in to a machine that is thrashing hard, let alone run any serious commands on it. The sooner you just let it crash, the sooner you can restore the system to a working state and inspect the logs in a more comfortable environment.
That assumes the OOM killer kills the right thing. It may well choose to kill something ancillary, which causes your OOM program to just hang or misbehave wildly.
The real danger in all of this, swap or no, is the shitty OOMKiller in Linux.
You can apply memory quotas to the individual processes with cgroups. You can also adjust how likely a process is to be killed.
Nowadays, the OOM killer always chooses the largest process in the system/cgroup by default.
The OOM killer will be just as shitty whether you have swap or not. But the more swap you have, the longer your program will be allowed to misbehave. I prefer a quick and painless death.
> OP tries to argue that a long period of thrashing will give you an opportunity for more visibility and controlled intervention.
I didn't get that impression. My read was that OP was arguing for user-space process killers so the system doesn't get to the point where the system becomes unresponsive due to thrashing.
From the article:
> With swap: ... We have more visibility into the instigators of memory pressure and can act on them more reasonably, and can perform a controlled intervention.
But of course if you're doing this kind of monitoring, you can probably just check your processes' memory usage and curb them long before they touch swap.
Amen to failing fast.
A machine that is responding just enough to keep a circuit breaker closed is the scourge of distributed systems.
Maybe I'm just insane, but if I'm on a machine with ample memory, and a process for some reason can't allocate resources, I want that process to fail ASAP. Same thing with high memory pressure situations, just kill greedy/hungry processes, please.
Like something is going very wrong if the system is in that state, so I want everything to die immediately.
sysctl vm.overcommit_memory=2. However, programs for *nix-based systems usually expect overcommit to be on, for example, to support fork(). This is a stark contrast with Windows NT model, where an allocation will fail if it doesn't fit in the remaining memory+swap.
People disable memory overcommit, expecting to fix OOMs, and then they get surprised when their programs start failing mallocs while there are still tons of discardable page cache in the system.
https://unix.stackexchange.com/q/797835/1027 https://unix.stackexchange.com/q/797841/1027
systems-oomd does this.
The kernel oom killer is concerned with kernel survival, not user space performance.
`sudo apt-get install earlyoom`
Configure it to fire at like 5% and forget it.
I've never seen the OOM do its dang job with or without swap.
If you've tried `systemd-oomd`, I'm curious what your thoughts are: https://www.freedesktop.org/software/systemd/man/latest/syst...
In my environment, `systemd-oomd` does nothing with the default settings.
I enable oom_kill on sysrq, so I can hit alt+sysrq+f to invoke OOM; in /etc/sysctl.d/10-magic-sysrq.conf I have `kernel.sysrq = 240` (ie. 128+64+32+16, 128 being the one for the f key).
Under no/low memory contention
on some workloads this may represent a non-trivial drop in performance due to stale, anonymous pages taking space away from more important use
WTF?
Welcome to the wonderful world of Java programs. When your tomcat abomination pulls in 500 dependencies for one method call each, and 80% of the methods aren't even called in regular use except to perform dependency injection mumbo jumbo during the 90 seconds your tomcat needs to start up, you easily end up with 70% of your application's anon pages being completely useless, but if you can't banish them to swap, they'll prevent the code on the hot path from having any memory left over for file caching.
So even if you never run into OOM situations, adding a couple gigabytes of swap lets you free up that many gigabytes of RAM for file caching, and suddenly your application is on average 5x faster - but takes 3 seconds longer to service that one obscure API call that needs to dig all those pages back up. YMMV if you prefer consistently poor performance over inconsistent but usually much better performance.
Java's performance for cold code is bad period. This doesn't really have to do with code being paged out (that very rarely happens) but due to the JIT compiler not having warmed up the appropriate execution paths so that it runs in interpreted mode, often made worse as static object initialization happening when the first code that needs that particular class runs, and if you're unlucky with how the system was designed that may introduce cascading class initialization.
Though any halfway competent Java developer following modern best practices will know to build systems that don't have these characteristics.
> Though any halfway competent Java developer following modern best practices will know to build systems that don't have these characteristics.
I'll let you know if I ever meet any. Until then, another terabyte of RAM for tomcat.
Java generally performs much better when it isn't given huge amounts of memory to work with.
swap.avi is its own damning defence
I've always created swap of 1.5x - 4x RAM size on every Linux computer I've had to manage and never had any issues with it. That's my rule that I learned many years ago, follow to this day and will follow.
Worst thing: I left 5% of my SSD unused which will actually be used for garbage collection and other staff. That's OK.
What I don't understand is why modern Linux is so shy of touching swap. With old kernels, Linux happily pushed unused pages to a swap, so even if you don't eat memory, your swap will be filled with tens or hundreds MB of memory and that's a great thing. Modern kernel just keeps swap usage at 0, until memory is exhausted.
> What I don't understand is why modern Linux is so shy of touching swap. With old kernels, Linux happily pushed unused pages to a swap, so even if you don't eat memory, your swap will be filled with tens or hundreds MB of memory and that's a great thing. Modern kernel just keeps swap usage at 0, until memory is exhausted.
The article has the answer.
> I've always created swap of 1.5x - 4x RAM size on every Linux computer I've had to manage and never had any issues with it.
That's a couple terabyte of swap on servers these days, and even on laptops I wouldn't want to deal with 300-ish GB swap.
I haven't used swap for 15 years. You have to be judicious about heavy app usage with only 16GiB. With 32GiB, I've never triggered OOM.