Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> especially around resharding. Foursquare had a 17 hour downtime in 2010 due to hitting a sharding edge case

so I opened link

http://highscalability.com/blog/2010/10/15/troubles-with-sha...

> What Happened? The problem went something like: Foursquare uses MongoDB to store user data on two EC2 nodes, each of which has 66GB of RAM

There's your problem. Using Mongo sharding in production in 2010.



> Users, written to the database, did not distribute evenly over the two shards. One shard grew to 67GB, larger than RAM, and the other to 50GB

Main pro-tip: think very hard on how you're sharding your data, based on it and how you're sharding (hash based? range based?) and how do you intend to rebalance it or be more granular should the need arise.


Nice example. I have personally seen Hadoop jobs which would only scale by using larger VM instance types.


What are Hadoop jobs? MapReduce? Someone still uses them, why?


Why not? Use comodity server to compute long running tasks dosen't seems a trend fading away anytime soon.

There are more solutions other than M/R now though (spark, presto etc) but they require high performance servers (presto suggested RAM amount is 128GB).


Can you name a single type of job that MapReduce can do better(faster/using less resources) then Spark? In my experience even for most simplest tasks like when you need to just read the data, change it slightly without shuffling and write it back Spark is faster then MapReduce with the same limitations on resources, and it's much more efficient in case of heavy jobs with joins etc. Well of course, there are other things like API which in MapReduce is just a nightmare to deal with in comparison to Spark.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: