As someone who was inconvenienced by the outage, and with no mitigation strategy...

balloot · on June 7, 2012

How do you not blame Heroku? You are paying them a good chunk of money to handle not only hosting, but failover strategies, multiple locations, etc. If you have to worry about any of those things they aren't doing their job.

That's like if you chose MySQL as the database, and then when an update had a huge bug that broke your site, you say "totally my fault that I don't have a version of the site that uses PostgreSQL."

slurgfest · on June 8, 2012

To clarify part of this, Heroku doesn't have the same position as (say) a router manufacturer because they are offering an all-in-one 'platform'. And unlike MySQL, they are charging you a decent rate for using it.

mnutt · on June 7, 2012

Failover across geographically distributed datacenters is a challenge that doesn't get talked about all that much.

As a small company you probably aren't able to easily get your own IP block allocated (that I know of) so BGP [0] isn't really an option and the best you can do is probably DNS switching. Use a good DNS provider and set your TTLs to something low like 30 seconds or 1 minute. Then when you have an outage, change the DNS entry to point to a secondary datacenter, which would have a static error page or a reduced-functionality site. There seems to be some debate around whether low DNS TTLs increase users' request times, but we haven't seen it.

There are some companies that will handle the monitoring and switchover for you (Dyn comes to mind) but we prefer to manually switchover for the time being. We have a Big Red Button sinatra app that reports the status of the site and allows you to fail over to the secondary and recover when the primary returns; I'm planning on open sourcing it once it gets some documentation.

I think the reason failover doesn't get talked about as much in the startup world is just because it's hard to do and the costs are disproportionately high for a small company unless availability is really critical to you. For most people, just using multiple availability zones on EC2 is probably sufficient.

[0] http://ajohnstone.com/achives/high-availability-across-multi...

gaius · on June 7, 2012

Even vendors that promise that, e.g. Amazon, aren't infallible.

chao- · on June 7, 2012

The gist that I'm getting from a few places seems to be: Have separate hosts/service providers, and do the load balancing yourself, or make switch yourself, when one fails. I've yet to find many detailed examples, though, as most similar articles deal with load balancing with in your own locally-managed network, or co-located set of machines.

The more generalized "Cloud plus Dedicated" fallback/load balancing seems fairly involved, and raises a lot of other questions, but at least I've got a path to follow now. Also would be more expensive, as a backup server might just be hanging around doing nothing at times.

Then again, it would pay for itself in satisfied customers after just a single event.

rdl · on June 7, 2012

The cost of running your own infrastructure at this level is slower development and ongoing hassle, vs. Heroku.

Unless your application absolutely must be up with higher availability than Heroku provides, it's probably not worth the effort. The easiest thing to do is to use something like Cloudflare in front of Heroku, so at least when Heroku is down, you can serve a static page to customers informing them of the problem and estimated time to fix.