My feeling on swap is this: 1) If you're ok with one machine dropping out of you...

galdosdi · on Feb 23, 2017

Tend to agree.

A bit of a side ramble: Unfortunately, sometimes regarding rule 2, you already have a system where losing a single machine is a problem, and it will take time and resources to improve or replace it to the point where losing a single machine isn't a problem, so "in the meantime" you have to accept and support this.

Also, sometimes "the meantime" is very long. :-(

Also, by the time the system is improved to be more resilient, maybe you'll be working somewhere else or on something else, and, presto, you'll uncover some other horrible legacy system in your dependency chain that isn't resilient either. It seems as if at every organization that has had computers for long enough, there is an infinite supply of legacy systems.

Point being unless you only work with brand new things that themselves only work with brand new things, you can't get out of getting decent at managing services that aren't properly "any single machine can disappear" resilient

jedberg · on Feb 24, 2017

Sure, dealing with legacy systems might mean messing with swap.

However, as pointed out elsewhere, if you're hitting swap your performance will be so bad you might as well have lost the machine.

perlgeek · on Feb 23, 2017

Doesn't that risk cascading failures?

A cluster of a few machines experiences a bunch of requests that trigger pathological memory usage. One machine OOMs, drops out. Now the rest of the cluster has to take more load, needs more memory, and increases the likelihood that the other machines also run out of memory.

scottlamb · on Feb 23, 2017

> A cluster of a few machines experiences a bunch of requests that trigger pathological memory usage. One machine OOMs, drops out. Now the rest of the cluster has to take more load, needs more memory, and increases the likelihood that the other machines also run out of memory.

A performance cliff (as you'd inevitably see while swapping) also puts you at risk of cascading failure. It might actually be better to completely drop out if the restart time is reasonably low. This is similar to GC thrashing with Java servers: many people prefer to configure their servers to suicide when GC time is over some threshold rather than try to go on as long as possible. I'm one of those people.

Better ways to avoid cascading failure are overprovisioning (RAM is pretty cheap for servers) and load shedding / graceful degradation at the application layer, coupled with care in client-side retry logic. (Avoiding accidental capacity caches, using exponential backoff on any retry.)