Based upon outages when a single AWS zone goes offline, a whole hell of a lot mo...

pojzon · on April 20, 2022

At my previous work we often were able to saturate all similar instance types in given region..

All AWS had to say was “we are working hard on providing additional capacity”.

The same often happens on black friday, where companies scale up their platforms just in case because there might be no capacity on AWS.

oneplane · on April 21, 2022

We have that 'issue' with many of our Spark workloads where there isn't any of our desired capacity available as spot, but we have a baseline reserved up front instances for anything realtime anyway so with a bit of planning it's a non-issue.

It does cost money, but then again, so does not running certain processes. The trick becomes calculating the intersection at which point the costs outweigh the benefits, and that calculation applies everywhere.

oneplane · on April 21, 2022

If you're in a single AZ you're not in multiple AZ. Migrating between AZs isn't multi-zone either. Running at 130% capacity in three AZs, that is multi-AZ (to us, in our availability configuration). If an AZ goes down (which in some regions we use has happend 0 times) we lose about 30% capacity, but since that's our margin of scaling anyway we can keep going as-is, even if there was no 15% additional capacity available in the remaining AZs.

Some sort of manual active-standby configuration really doesn't require AWS or a Cloud, that stuff is the same 90's implementation it has always been and practically boils down to attaching your RAID1 USB HDDs from one PC to another PC and booting that bad boy up as 'failover'. (yes, that's an example, and yes it's an extreme one)

If you have capacity planning, and you plan accordingly, you take service provider limits into account, just like you would with anything else. Having two power feeds into a distribution warehouse doesn't help much if neither can't handle 100% of the load in an industrial park. So while having two feeds might seem 'redundant' to a single tenant or customer, it's only really redundant if either can supply all the demand of all connected customers.

The same applies to fiber connections, plenty of fake-redundant connections that are suggested by customers to be 'redundant' turn out to end up at the same PoP and if the PoP goes down your redundant fibers are worthless. In the same logistics distribution scenario, your trucks can't deliver goods if the destination warehouse itself is offline, and now you need redundant warehouses.

That's obviously a weird thing to do at smaller scales, but the fact remains that AWS having an AZ go down is only a small piece of the puzzle, and only really a problem if you didn't plan for it appropriately.