> Users, written to the database, did not distribute evenly over the two shards. One shard grew to 67GB, larger than RAM, and the other to 50GB
Main pro-tip: think very hard on how you're sharding your data, based on it and how you're sharding (hash based? range based?) and how do you intend to rebalance it or be more granular should the need arise.
Why not? Use comodity server to compute long running tasks dosen't seems a trend fading away anytime soon.
There are more solutions other than M/R now though (spark, presto etc) but they require high performance servers (presto suggested RAM amount is 128GB).
Can you name a single type of job that MapReduce can do better(faster/using less resources) then Spark?
In my experience even for most simplest tasks like when you need to just read the data, change it slightly without shuffling and write it back Spark is faster then MapReduce with the same limitations on resources, and it's much more efficient in case of heavy jobs with joins etc.
Well of course, there are other things like API which in MapReduce is just a nightmare to deal with in comparison to Spark.
so I opened link
http://highscalability.com/blog/2010/10/15/troubles-with-sha...
> What Happened? The problem went something like: Foursquare uses MongoDB to store user data on two EC2 nodes, each of which has 66GB of RAM
There's your problem. Using Mongo sharding in production in 2010.