Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As the author notes, that's mainly due to the (according to him, but also the only explanation I can think of) lookup table implementation.

Something like this should be done in a hash table for constant lookup regardless of the number of entries (as entries are not being added "in real time" non-stop, the cost of bucket resizing shouldn't be too great) or with a trie for the best storage properties for many URLs for the same domain (the case author notes). If done right (correct hashing algorithm, good implementation, decent collision handling for the hash table; or any decently-performing in terms of space/time for the trie) this shouldn't be a problem.



It's always going to be a problem; even the best hash tables of this size will suffer performance problems as they age. As the table increases in size, less of it will fit in memory/various caches, and more of it will involve (expensive, potentially numerous) disk seeks.

Once you take caches into account, hash tables are not ammortized O(1). Having said that, those graphs show delays of well over 100 milliseconds, and that sounds excessive. It's possible the delay is primarily due to a poor implementation and not so much due to inherent limitations.


For something like link coloring over a long history, a Bloom filter would seem to be ideal for reducing the number of true hash table lookups you'd need per page.


Large bloom filters have exactly the same problem. And since they're fixed size, you'd need a potentially huge bloom filter to avoid huge numbers of false positives; more likely you'd need to periodically regenerate it based on the original data.

This is a really tricky optimization because on a positive hit you've introduced more random I/O! After all, you've got the bloom filter and then the hash table lookup. False positives are also bad - so you only save something on true negatives. Is it worth it? Only if you get the tuning just right.


Thank you so much on for this comment. I'm currently teaching a lesson on hash tables and this happens to be a great example of their real-world applications and implications.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: