Isn't the mapreduce sort NlogN in the number of entries? That would make 1e13 log 1e13/(1e10 log 1e10) 68s/4 = 6.1h. So they were actually faster than the possible scaling? Doesn't sound right...
I was assuming that the sort was actually a partial sort - returning the top "x" (admittedly - even then isn't linear). However, I was also assuming (n << x), which would almost certainly be the case with a large data set. It's a fairly common approach, but not sure why I automatically assumed it.
If the sort is complete, then yeah, that changes things.
Sounds dead on to me. The typical sort is O(N logN). O can be slightly different between the 1T runs and 1P runs, e.g. slightly difference latency or pattern of disk failures. Every sort is sensitive to the data distribution. There is probably an interesting distribution of results across the runs.
We were writing it to 48,000 hard drives (we did not use the full capacity of these disks, though), and every time we ran our sort, at least one of our disks managed to break (this is not surprising at all given the duration of the test, the number of disks involved, and the expected lifetime of hard disks).
I'm a bit at lost here. Does that mean that one in ~50K hard disks fail within six hours of usage?
1 failure per 50K in 6 hours is the same as 1 per 34 in a year, or a 3% annual failure rate. I think this is the same number Google reported in their disk failure paper.
Actually, I did this math before I posted my comment.
What I don't understand is this:
With x% annual failure rate, are the failures uniformly distributed over time? I would expect that it's probably a normal distribution over time. (i.e., no failures to very few failures in the beginning, and the failures keep increasing with time).
It's probable that they didn't use 48,000 brand new hard drives, but reused whatever they already had in their datacenters. Which would have a mixture of different drive ages, I guess.
But the fact that the article specifically mentions about drive failures and that they use some of RAID to cope with this makes me wonder if they were using old disks.
It's probably easier to get all new drives and use them isn't it?
On second thoughts though, it might be much easier to tell GFS to make an extra mirror than to make sure that all disks are brand new.
So, i buy your line of thinking :)
Speaking of electricity, at one point Google claimed that during a Web search your PC uses more energy sending the request, waiting, and displaying the results than Google uses to perform the search. So the next time you need to sort 1PB of data it may be cheaper to send it to Google. :-)
(68sec * 1024) / 60 / 60 / 4 --- (68 sec for 1TB) (1024 to make 1TB) /60/60 (hours) and /4 as there are 4x as many computers....
So ideally taking the 1TB sort to 1PB on 4x the hardware would be 4.83hrs.
Google's 1PB sort was in 6 hours - So that's fairly linear. Impressive.