I have to ask, is there even a practical purpose for this? Is there even some re...

lucaspiller · on Feb 20, 2013

WhatsApp had over 2 million users connected to their (Erlang) server last year.

http://blog.whatsapp.com/index.php/2012/01/1-million-is-so-2...

Here is the same in Erlang for reference (from a few years ago, I would be interested to see if there is a more efficient way now):

http://www.metabrew.com/article/a-million-user-comet-applica...

e12e · on Feb 20, 2013

Thank you so much for that last link especially. Interesting stuff :-)

slivuz · on Feb 20, 2013

start with Ruby as backend for online games...

continue with Ruby as backend for audio/video chats...

consider Ruby for streaming podcasts...

etc. etc.

3amOpsGuy · on Feb 20, 2013

What happens when one of the fully loaded 1 million connection nodes goes bang? That's potentially a million users getting a poor experience.

Re-establishing a million connections at once is going to be hard on the network - the million were built up over a period of time previously yet now they're being re-established Big Bang style.

josephlord · on Feb 20, 2013

For any given user the probability of the one machine with everyone on it going bang is similar to the probability of the particular server that they were connected to in a horizontally scaled scenario. However the cost of redundancy may be higher if it is a replication of 100% of main system on the other hand a big system may be designed for high uptimes.

3amOpsGuy · on Feb 20, 2013

Would the probability not be less in this case? In general, less moving parts = less chance of outage. E.g. if a device is rated for 300,000 hours MTBF and you have 2 of them, their individual MTBF remains the same, but your chance of experiencing an outage in either one has doubled because you have 2 of them.

It's more the impact side of the risk equation i'm thinking of than the probability.

EDIT: typo

josephlord · on Feb 21, 2013

Depends whether looked at from the ops point of view or the end user point of view. You expressed concern about 1 million customers simultaneously having a bad experience. For a given end user if the hardware is equally reliable the odds of something happening are the same whether they are sharing with 1 million or 1 hundred thousand (or even have the server to themselves). On the ops side there is more to go wrong and failures will be more frequent but affect less end users each time.

The positive in the one big machine scenario is that you have potential to take strong efforts to keep it reliable. The advantage in the lots of machines scenario is that there is a better chance you have well tested failover solutions.

It is the combination of impact and risk that I am discussing.