Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: BoomFilters – Probabilistic data structures for processing streams (github.com/tylertreat)
70 points by tylertreat on April 11, 2015 | hide | past | favorite | 13 comments


// The InverseBloomFilter may report a false negative but can never report a false positive.

This is actually not true and not possible. There is a good explanation here: http://cstheory.stackexchange.com/a/14455/43

There will be a low probability of false positives ~ 2^32*k. (Also it's not an inverse-bloom filter, it's a cache).


Well, I'm looking at the code and it looks like it doesn't report false positives. I understand that you expect it to hash the content down and store only the hash, but it doesn't do that.

EDIT: I suppose my point was just that there is a difference between "I think you chose a bad name for your class" and "This is actually not true and not possible". It seems both true and possible (and arguably inappropriately named).


Then it doesn't do it in constant space. Constant space is the property that makes it not a true inverse bloom filter, and instead is just a cache with an interesting hashed based eviction method.

That said, this is just a silly definition issue. You can provide a data structure with no false positives, and some false negatives, but you can't do it in constant space. I'm not aware of if you can absolutely bound the rate of false negatives.


I guess it's storing the pointer to the data? That sounds pretty bad.

At any-rate it's one of two things, either you store the whole item or you store a hash. If you're storing the whole item it's not a filter and you don't get constant space guarantees as you mentioned. If you store a hash - it's possible to have false positives.

You can always get a bound for false-negatives (or none at all) same as it with any hash-table. If you wanted to, you could Cuckoo hash before eviction and you should be able to get a pretty-good theoretical bound for false-negatives (http://infoweekly.blogspot.com/2010/02/cuckoo-hashing.html).


His answer hints at an interesting solution: For a small element store it. For a big element do nothing.


I have a hash thing I'm working on that spills well over the Ln caches, so hitting the alternative buckets (its a linear prober) is a non-starter. Turns out the probabilities for an item ending up in an alternative bucket is in the 0.0n% range.

3 bits end up doing what would require a multiword bloom filter with 30 k hashes.



If it's a cache, there is no chance of false positives.


That InverseBloomFilter sounds like an LRU cache of hashes. A true inverse bloom filter can be proven to be impossible, sadly.


With enough data, a traditional Bloom filter "fills up", after which it has a false-positive probability of 1.

I don't think this correct. The false-positive probability wouldn't become 1, it's just that every query would return positive, regardless if it's a true or false positive. I think what you meant to say was that the positive probability is 1, not the false-positive probability.


Why does the readme use Boom and Bloom interchangeably? Is it just a typo that for some reason is randomly scattered throughout the document (and in the repo name itself)? Or is there meant to be a difference? I don't see anywhere that the difference is explained, if it's intentional.


Looks consistently used to me. The project is a collection of stream-optimized bloom filters and the project or its filters, as a whole, are referred to as "Boom Filters" in comparison to the "Classic Bloom Filters" referenced in the paragraph directly above. Basically, use "bloom" except when referring to this exact source code or library name.

See also: references section at the bottom of the readme for background on bloom filters, examples, etc.


You should checkout Tyler's blog. Good solid content for the concurrency geeks.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: