Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As someone who was inconvenienced by the outage, and with no mitigation strategy in place, I DON'T blame Heroku. The weight is placed squarely on me (lone tech in our company) for not having researched how to distribute services alongside Heroku, or fall back to something else, or whatever the proper term is.

I've been googling like mad since this morning, finding a few mostly-unanswered StackOverflow questions and a smattering of blog posts, but I haven't learned much. The only clear-cut answers I've seen are:

1. Hire a sysadmin who knows more than you do (But whole point is that I want to learn for myself!).

2. Pay for a service that will host in multiple geographic locations for you, and do the switchover (recovery? fallback? I don't know my terms here) for you.

3. A few mentions of "load balancers" and "heartbeat monitors". Sounds self-explanatory, and these are my current terms of googling.

Any suggestions on where to start acquiring this sort of skill? I'm prepared to teach myself anything, but the problem is not knowing the terms for what I want to learn.

EDIT: Well, just watching this thread is helping a bit.



How do you not blame Heroku? You are paying them a good chunk of money to handle not only hosting, but failover strategies, multiple locations, etc. If you have to worry about any of those things they aren't doing their job.

That's like if you chose MySQL as the database, and then when an update had a huge bug that broke your site, you say "totally my fault that I don't have a version of the site that uses PostgreSQL."


To clarify part of this, Heroku doesn't have the same position as (say) a router manufacturer because they are offering an all-in-one 'platform'. And unlike MySQL, they are charging you a decent rate for using it.


Failover across geographically distributed datacenters is a challenge that doesn't get talked about all that much.

As a small company you probably aren't able to easily get your own IP block allocated (that I know of) so BGP [0] isn't really an option and the best you can do is probably DNS switching. Use a good DNS provider and set your TTLs to something low like 30 seconds or 1 minute. Then when you have an outage, change the DNS entry to point to a secondary datacenter, which would have a static error page or a reduced-functionality site. There seems to be some debate around whether low DNS TTLs increase users' request times, but we haven't seen it.

There are some companies that will handle the monitoring and switchover for you (Dyn comes to mind) but we prefer to manually switchover for the time being. We have a Big Red Button sinatra app that reports the status of the site and allows you to fail over to the secondary and recover when the primary returns; I'm planning on open sourcing it once it gets some documentation.

I think the reason failover doesn't get talked about as much in the startup world is just because it's hard to do and the costs are disproportionately high for a small company unless availability is really critical to you. For most people, just using multiple availability zones on EC2 is probably sufficient.

[0] http://ajohnstone.com/achives/high-availability-across-multi...


Even vendors that promise that, e.g. Amazon, aren't infallible.


The gist that I'm getting from a few places seems to be: Have separate hosts/service providers, and do the load balancing yourself, or make switch yourself, when one fails. I've yet to find many detailed examples, though, as most similar articles deal with load balancing with in your own locally-managed network, or co-located set of machines.

The more generalized "Cloud plus Dedicated" fallback/load balancing seems fairly involved, and raises a lot of other questions, but at least I've got a path to follow now. Also would be more expensive, as a backup server might just be hanging around doing nothing at times.

Then again, it would pay for itself in satisfied customers after just a single event.


The cost of running your own infrastructure at this level is slower development and ongoing hassle, vs. Heroku.

Unless your application absolutely must be up with higher availability than Heroku provides, it's probably not worth the effort. The easiest thing to do is to use something like Cloudflare in front of Heroku, so at least when Heroku is down, you can serve a static page to customers informing them of the problem and estimated time to fix.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: