Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Scaling Facebook Chat to 70 Million Active Users Almost Overnight (facebook.com)
93 points by edw519 on May 15, 2008 | hide | past | favorite | 42 comments


As previous news articles state, Facebook has some implementation of XMPP going on. XMPP was designed from the ground-up to deal with exactly the issues that he highlights, and is the ideal real-time implementation for any system where everyone is expected to be aware of the statuses of all others on the network (verses the traditional "poll the server every x seconds" methods).

Even if Facebook isn't using XMPP per-say, they have full access to its implementations and source code for internal use for sure.

Granted, Facebook _does_ have the "slight" challenge of having 70 million active users; in light of which near everyone else's IM/XMPP networks are a mere pittance; but the core framework and algorithms are wholly addressed and implemented in XMPP standard.

It's one thing to make a more-efficient implementation of an already-existing standard that scales damn decently verses _designing_ a whole new system to serve their needs.

Note that the article doesn't once mention XMPP though.


Err .. as far as I understood their Jabber/XMPP announcement, these are used for interoperability and integration with 3rd party products only and not for the internal implementation.

So it's only natural that the article doesn't mention XMPP.


I can't say that I prefer in-browser chats, but I can appreciate the complexity of the solution. Scalability is the new manual memory management. I can't help but think that somebody's going to come up with the scalability equivalent of a garbage collector and make our lives a lot easier.


That's Erlang in a nutshell, though, isn't it?


Erlang still has a lot of warts. Just wait until we have cheap threads in the JVM.


The JVM had them a long time ago. They're gone, and it will be very hard to bring them back.


i think instant scalability will come from an application server, something like glassfish for example

deploy your application to glassfish and let your AS scale it


That doesn't really handle data access / storage bottlenecks, unless I'm missing something.


or google app engine.


The naive implementation of sending a notification to all friends whenever a user comes online or goes offline

Did I miss it, or does the note not mention how they actually implemented the notification?


I didn't see an explanation for an alternative solution. I've been trying to think of one all morning and can't!

Maybe he just means all friends whether they are logged in or not?


no, they're not sending notifications at every event. As far as I can tell, they're using an asynchronous algorithm that lazily propagates events and provides no responsiveness guarantees. (sort of ultra mushy stretchy unreal time guaranteed)


Sorry, but how is that different to sending all notification events to all users? You are still sending all notification events to all users, whether you do it lazily or not!


I guess they meant they don't send a notification per message, but rather batch them somehow.


exactly! This leads to people having inconsistent (with reality) information about the availability of other people who have just logged on or off!


The dark launch idea is neat:

"The secret for going from zero to seventy million users overnight is to avoid doing it all in one fell swoop. We chose to simulate the impact of many real users hitting many machines by means of a "dark launch" period in which Facebook pages would make connections to the chat servers, query for presence information and simulate message sends without a single UI element drawn on the page. With the "dark launch" bugs fixed, we hope that you enjoy Facebook Chat now that the UI lights have been turned on"

Oh, they spilled the beans on using Erlang two weeks ago:

http://news.ycombinator.com/item?id=179064


This is quite a remarkable piece of software engineering, very impressive stuff. I'm really glad they're open enough about it to share their techniques.

Also, I'd never heard of doing a "dark launch" before, but it sounds like a fantastic way to get early feedback from users.


It sounds like the "dark launch" wasn't visible to users, it was simply to test the capacity of their servers. It's a very interesting idea.

They did roll out the UI over the course of several hours though.


Also they did stage the launch over a few weeks. I noticed it appear on my facebook page (because I'm in the Stanford network) well before it appeared on most of my friends'


i think the take-away here is the "dark launch" mentioned in the last paragraph, not necessarily the behind the scenes tech. although nice win for erlang here.

first time i have heard a company mention, publicly, about pushing features behind the scenes and testing in realtime. ajax makes this functionality possible nowadays.

good job facebook.


Dark launching is cool, but Google did it last year with Gmail Chat: http://video.google.com/videoplay?docid=6202268628085731280


Very interesting stuff. I wish more companies would post details like this.


That indeed is amazing innovation. Who would have thought zukerbergs team would have been open about their innovations. They are usually quite secretive about their future plans. I think having this kind of conversation with their user/developer community is amazing. More companies need to do this and dirty with the technical stuff not just a high level talk.


This development story is so awesome it saddens me that it's such a terrible idea for a product.

The world really doesn't need another proprietary chat standard, especially one that locks people into a website.



I correctly guessed at the use of Erlang for the web servers; persistent connections and pushing is a must and Apache is hardly designed for so many persistent processes. Thrift was also a pretty easy call considering it's a FB project; I still want to check that out, too.

The information wasn't incredibly in-depth but it's very cool and useful nonetheless to read about implementations like this on such a large scale. The chances of me ever creating something with the scale and resources that FB requires is pretty slim, but it's gratifying to know I've at least got a rough idea of some good ways to do it.

Now, if we could just get Twitter to do the same, perhaps someone could give them a few pointers... ;)

"Did I miss it, or does the note not mention how they actually implemented the notification?"

No, it doesn't go over that implementation, though it piqued my curiosity nonetheless. I would assume it's a time-based check on status rather than a real-time representation.


As I look at this article now, it reads "userbase" not "active users." Did an earlier version try to claim facebook has 70 million active users?


Considering they have 10,000 servers+, I don't think "scaling" is that big a feat.

So say they have every active user on at the same time (70million), and say they have 10,000 servers. That's only 7k users per server??? Not great IMHO

Maybe if they only had 1,000 servers, then it'd be a little more impressive.


Coordinating something that large is a feat. At that scale, everything is more difficult. Deploying updates to 10,000 machines, redundancy, upgrading, etc. Not to mention the avoiding bandwidth, memory, and process limitations, load balancing, and testing it, which they needed a clever solution to achieve.


I'm sure a big number of these servers is used for data crunching/warehousing and ad serving -- not just for rendering pages.


Yea, 10,000 servers is a serious number - is that actually an accurate figure?


"Facebook does not disclose the number of servers it operates. But research firm Data Center Knowledge puts the tally at about 10,000. The slug of cash will help Facebook buy approximately 50,000 more servers"

60,000 servers? Jesus christ. Are they planning to scale to take account of Alien users or something?


What I don't understand is why Facebook feels slow each time I logon with these 10K servers - its often one of the slowest sites I visit!

If they are intending to have 60K servers, they better keep a slug of that cash around to pay the electric bill!


I realized recently that Facebook is trying to rebuild the entire Web inside their site. Home pages, check. Email, check. IM, calendaring, photo sharing, dating, check. Next they will add VoIP, photo editing, an office suite...

So how many servers do you need to replace the entire Web? 60,000 doesn't sound like enough.


the 10k number was spilled a few weeks ago at a conference... along with numbers from some other groups like mysql and others.


also the 3 guys in garage don't have the luxury of access to so many servers and money...


The only reason this is considered impressive is because so many other services have set the bar so low. This isn't rocket science, it just requires thinking ahead and designing for scale.


What you would call rocket science? This is pretty impressive.

Erlang does a lot of the heavy lifting but even if it did everything out of the box (and it didn't), gluing it all together is no small feat.


A chat server? Yawn. And it's not for 70 million people at once, that's user base, not simultaneous logins.

AIM, Bonjour, Gadu-Gadu, Google Talk, GroupWise, ICQ, IRC, MSN, QQ, SILC, SIMPLE, SameTime, XMPP, Yahoo, this is a solved problem.


Agreed. I find it ironic since a piece like this is likely written at least partially in an effort to attract programmers to facebook. To me it reads: "Come work for facebook and re-invent online chat. Again." Now, if the article had been about how they pushed the state of the art, I would be pretty interested, but none of this not new technology.

As for the dark launch thing, it is a fancy trick, but there are ways of doing load testing in an automated system by having test servers simulate the load from real users. This usually can give you much better data without wasting bandwidth, slowing down users' experiences, etc.


Not to mention many of those services do have 70 million people on at once.


Yes, it has been done before... But I don't think it has been done in an environment where you need to scale up quickly.

They had a problem, they solved it in a very cool fashion using the best tool for the job (Erlang), they thought of a very good way of testing the application, and they were gracious enough to share their experiences with us, other developers...

Seems like a win all around, so I'm not going to complain.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: