I'd be interested to know more about this as I've been curious for a while about how people do this stuff.
No matter how many instances you have, surely you'll still be hosed if they all go down at the same time? Or if there's rolling downtime taking out instances faster than you can bring the restarted instances up to speed?
So if you're replicated across three availability zones you're not truly prepared for any instance to go down at any time - you're only prepared for two thirds of your instances to go down at a time?
There are lots of ways to set it up. I should note first that most interesting datastores you'll run in the cloud will end up needing instance stores for performance reasons anyway--you want sequential read perf, you know?--and so this is really just extending it to other nodes that, if you're writing twelve-factor apps, should pop back up without a hitch anyway. (If you're not writing twelve-factor apps...why not?)
Straight failover, with 1:1 mirroring on all nodes? You're massively degraded, unless you've significantly overprovisioned in the happy case, but you have all your stuff. Amazon will (once it unscrews itself from the thrash) start spinning up replacement machines in healthy AZs to replace the dead machines, and if you've done it right they can come up and rejoin the cluster, getting synced back up. (Building that part, auto-scaling groups and replacing dead instances, is probably the hardest part of this whole thing, even with a provisioner like Chef or Puppet.) If you're using a quorum for leader election or you're replicating shard data, being in three AZs actually only protects you from a single AZ interruption. Amazon has lost (or partially lost, I wasn't doing AWS at the time so I'm a little fuzzy) two AZs simultaneously before, and so if you're that sensitive to the failure case you want five AZs (quorum/sharding of 3, so you can lose two). I generally go with three, because in my estimation the likelihood of two AZs going down is low enough that I'm willing to roll the dice, but reasonable people can totally differ there.
If Amazon goes down, yes, you're hosed, and you need to restore from your last S3 backup. But while that is possible, I consider that to be the least likely case (though you should have DR procedures for bringing it back, and you should test them). You have to figure out your acceptable level of risk; for mine, "total failure" is a low enough likelihood, and the rest of the Internet likely to be so completely boned that I should have time to come back online.
Thing is, EBS is not a panacea for any of this; I am pretty sure that a fast rolling bounce would leave most people not named Netflix dead on the ground and not really any better off for recovery than somebody who has to restore a backup.
> (Building that part, auto-scaling groups and replacing dead instances, is probably the hardest part of this whole thing, even with a provisioner like Chef or Puppet.)
There are totally ways to do it, but it involves a good bit of work. I like Archaius for feeding into Zookeeper for configs (though to make it work with Play, as I have a notion to do, I have a bunch of work ahead of me...).
>So if you're replicated across three availability zones you're not truly prepared for any instance to go down at any time - you're only prepared for two thirds of your instances to go down at a time?
No matter how many instances you have, surely you'll still be hosed if they all go down at the same time? Or if there's rolling downtime taking out instances faster than you can bring the restarted instances up to speed?
So if you're replicated across three availability zones you're not truly prepared for any instance to go down at any time - you're only prepared for two thirds of your instances to go down at a time?