Why we didn't use a bloom filter

TeMPOraL · on March 17, 2012

"Have you ever used the STL? I did for about 6 months professionally, and I'm glad that I could go back to C and Python. More seriously, object orientation in C++ doesn't come for free. Aside from the pain of the syntax of using C++ templates (and difficult to parse error messages), all of that object orientation hides the complexity of method dispatch in what are known as vtables. If your system has to call them to dispatch, you are wasting time in that processing when you could be doing something else."

This fragment made me WTF, so either I missed something in the C++ development of last few years, or...

a) How can you make standard-OOP-like method dispatch faster than vtables?

b) Since when STL uses so much of virtual functions anyway? Last time I checked, it avoided any kind of polymorphism at all, for speed reasons. (and it doesn't really need to use much; C++ templates are nice tools that can make a pointer to function and a function object work in the same place just because you can stick "()" to the right of the passed parameter and it will compile, no run-time work needed).

c) (it seems to be implied, though not explicitly stated). Going faster than STL... in Python?

zvrba · on March 17, 2012

Right. When I read that paragraph, my thought was that this guy has no clue what he's talking about.

STL is anything but object-oriented; Stepanov is one of the greatest critics of OO. Also, STL goes to great lengths to resolve everything at compile-time through, e.g., tag dispatch. No indirect calls are involved unless you introduce them yourself. Bounds-checking may be enabled in debug-mode in some implementations, but it can always be disabled, and ways of doing so are well-documented.

gorset · on March 17, 2012

> a) How can you make standard-OOP-like method dispatch faster than vtables?

The JVM optimizes this by using (P)ICs, which are faster than regular vtables.

http://en.wikipedia.org/wiki/Inline_caching

wisty · on March 17, 2012

> This fragment made me WTF

Well, he said he's used the STL for professionally 6 months, so he's a raw beginner by his own admission.

viraptor · on March 17, 2012

I disagree with the "raw beginner" description. If you use something professionally, as in a day to day job, for 6 months, you're not going to be a beginner. It's a standard template library, not some crazy theory requiring a phd to apply. Unless you're trying to say that STL is so messed up you can't learn the basics of it in half a year?

conroy · on March 17, 2012

In response to c, pypy has been getting ridiculously fast as of late. In some very specific (ie pointless) benchmarks, it can outperform C [1]. And it's only getting faster [2].

[1] http://morepypy.blogspot.com/2011/08/pypy-is-faster-than-c-a...

[2] http://speed.pypy.org/

shin_lao · on March 17, 2012

Every language has got at least one benchmark saying "We're faster than C".

ajays · on March 17, 2012

I was about to post that. Someone once joked, "a language isn't mature until it has a benchmark showing that it is faster than C".

TeMPOraL · on March 17, 2012

It's good to know that, thanks :). I'll look into PyPy if I ever start doing some serious work in Python.

DrJosiah · on March 17, 2012

I've updated the article to reflect the reasons why I pointed at vtables, which, as you say, are very likely incorrect.

And no, the article does not say that Python was performing faster than the STL. In the first sentence in the "What I Built" section, "Using C, I wrote a handful of lines of C code that took a sorted sequence of 4-byte integers that represented a Twitter user's followers, and I intersected that sequence against thousands of other similar sequences for other Twitter users." I mention C twice, despite it being awful grammar. I wanted to make it clear up front that this wasn't a "Python vs. X" post.

Confusion · on March 17, 2012

  a) How can you make standard-OOP-like method dispatch 
  faster than vtables?

He doesn't claim you can: he just points out that it still adds overhead.

vidarh · on March 17, 2012

It only adds overhead when using virtual member functions, in which case the equivalent in C is to "roll your own" vtables. You don't get away from that overhead.

If you don't use polymorphism, you don't need virtual member functions, in which case you don't pay the cost of a vtable.

This is one of the key tenets of C++: You pay for what you use, and only what you use.

zeugma · on March 17, 2012

Actually you can do it using template but it is quite convoluted. (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.130....)

ComputerGuru · on March 17, 2012

Really an unfair article. The premise, title, and bulk have nothing to do with what actually happened.

With his custom software he's solving a completely different problem with a different algorithm that was only possible based on observations unique to his dataset:

I originally started with code very similar to the C++ STL set intersection code and was dissatisfied with the results. By observing that the longer sequences will skip more items than the shorter sequences, I wrote a 2-level loop that skips over items in the longer sequence before performing comparisons with the shorter sequence. That got me roughly a 2x performance advantage over the STL variant.

Let's call this apples. Then he talks about how using C++ STL wouldn't do the trick since he doesn't care about the data, only the count (oranges). And how bloom filters applied naïvely to the raw data would still take longer (zebras).

He's solving a completely different problem, and lambasting perfectly viable technologies for taking longer to solve a completely different problem.

Now obviously there's a good blog post and moral in there: don't solve the wrong problem. Generic algorithms can't be both generic and special at once - they're not always going to give you what you're looking for and only what you're looking for, and that's a cost to be taken into consideration at any time that you're trying to choose a solution. But don't criticize them for being slow, and then in a postscript say "and I don't need the actual results of this algorithm, anyway."

Confusion · on March 17, 2012

I don't understand your response. Whom or what was he unfair to? What technology was he 'lambasting'?

  But don't criticize them for being slow, and then in a 
  postscript say "and I don't need the actual results of
  this algorithm, anyway."

I don't see him criticising any solution for 'being slow' in general. Only for 'being too slow for this specific problem'.

It's very common to use a (already available) algorithm to calculate a result related to what you actually want and then infer the answer from that result. E.g. calculate the intersection of two sets and then count the members, instead of intersecting and counting simultaneously and never having the intersection available. It's usually a wise decision to use such the first approach: the performance penalty is acceptable and the advantage of code reuse (and not having to write and test a new algorithm) is larger. His point was that in this case, it wasn't.

fauigerzigerk · on March 17, 2012

I didn't find it unfair at all. Comparing some general purpose approaches to a more specialzed one is not an apples and oranges comparison. It's the stuff that many orders of magnitude type optimizations are made of.

Also, keeping cache friendliness in mind in the design and choice of datastructures and algorithms is a good idea, which is the main point of his article.

What he says about vtables is fluffy though. The STL doesn't use polymorphism much.

brazzy · on March 17, 2012

The way I understand it, the article was not meant as a general criticism of bloom filters, but a reply to some people commenting that he should have used . bloom filters when he described his custom algorithm in an earlier article.

sophacles · on March 17, 2012

It isn't undue or unfair criticism to say "I looked at these solutions to my problem, here are the issues with those solutions with respect to my problems".

If I have been using a screwdriver deck screws to attach things to my project, and some guy says "hey why don't you just use duct tape", and I explain that duct tape is very useful, but in this case its ugly and doesn't hold stuff on right, am I being unfair to duct tape? What about some guy saying "just use a hammer and nails?" and again, I mention that nails are a decent solution, but screws solve a problem introduced by the nails, am I being unfair to the hammers?

In both cases, the answer is "No". When explaining technical choices, there must be some discussion of why solutions are better or worse. When one of the criteria is speed, calling a rejected solution too slow is not unfair, just fact. It doesn't even speak to other applicability of the solution. Heck, in this case he even explicitly states he likes those tools, meaning that pointing to specific case where they aren't good is probably the opposite of "unfair".

DrJosiah · on March 17, 2012

Let's say that you need to solve problem X. So you solve it with solution Z.

People then come by later and say, "you should use A", or "B is better in that case", or even "C solves everything".

The article was pointing out why A, B, or C would not solve problem X as well as solution Z. A, B, and C are generic solutions for similar problems, but Z was purpose built for X.

Also, at no point did I say that the algorithms were slow, only not as fast as what we needed.

radiospiel · on March 17, 2012

I actually like both this and the "1000 fold performance improvement" articles. There is a problem, and then there is a solution, which works, and is faster in a spectacular way.

However, both articles mention that a C++/STL solution would bring in a vtable runtime penalty. This is just not true - quite the opposite: using C functions a la qsort(3) require you to explicitely pass in a pointer to the comparison function. Now: this is the vtable approach, but in plain sight. A C++ version would quite likely not call any functions via pointers.

I prepared a small test which shows that in simple cases C++ can actually be faster than C by a factor of up to 3: http://radiospiel.org/sorting-in-c-3-times-faster-than-c

comex · on March 17, 2012

Interesting article, but I find it hard to believe that STL code would use virtual methods. Certainly it makes sense to use a custom algorithm when only a count is required, but was the claim about vtables actually tested?

Also, if speed is so important, why use qsort? It requires an indirect function call for every comparison; C++ sort or any inline implementation is faster.

caf · on March 17, 2012

The impression I got is that the overall algorithm requires O(n) qsorts and O(n^2) intersections, so the speed of the sorting is not important.

andersh · on March 17, 2012

The presence of the "generic C++ hater" type comment about vtables, when talking about STL algorithms, was a huge red flag making me doubt the guy's argument, for the reasons you gave.

zeroonetwothree · on March 17, 2012

The intersection problem they are solving is pretty trivial. I don't know why you would even consider using a bloom filter for it, that's not exactly the best domain for it.

Also, I would have just used the C++ set_intersection function. It seems unlikely that the 2x speedup matters since they already got it down from 7 seconds to 6ms.

DrJosiah · on March 17, 2012

With the work we were performing, the 2x improvement may not have been necessary, but it was useful. We could run everything on 1 box and keep up with roughly 30-50% system utilization continuously. Had we used the STL version, we would have needed another machine, and may not have been able to use mmaps to share data between cores.

gorset · on March 17, 2012

Performing the intersection of sorted lists is a well studied problem in information retrieval, so nothing new or surprising here...

The performance can be improved even further by for example using a compressed integer list with an embedded skip list or by using compressed bitmaps. Using sorted int32 lists is the naive solution :-)

alecco · on March 17, 2012

It's not that simple. Those solutions for easy lookup often require certain types of distributions. For example to use compression of the style of PFOR/PFOR-Delta. If compression isn't 1-to-1 in some way, lookup becomes a lot harder.

dhruvbird · on March 17, 2012

Is this guy serious or joking?

"Have you ever used the STL? I did for about 6 months professionally, and I'm glad that I could go back to C and Python. More seriously, object orientation in C++ doesn't come for free. Aside from the pain of the syntax of using C++ templates (and difficult to parse error messages), all of that object orientation hides the complexity of method dispatch in what are known as vtables."

Seriously?

1. Python is faster than C++?

2. All OO features (method invocations) in C++ use the vtable?

IMHO, just using the right iterator abstraction and using the STL algorithms should be within performance limits.

DrJosiah · on March 17, 2012

"Using C, I wrote a handful of lines of C code that took a sorted sequence of 4-byte integers that represented a Twitter user's followers, and I intersected that sequence against thousands of other similar sequences for other Twitter users."

In this case, a custom C algorithm is very fast. I didn't implement the C++ version, but I theorized that the C++ version would be slower because I'd already implemented the C++ algorithm first, and it was slower than what I ended up with.

I've also updated the article to remove the vtable stuff.

brown9-2 · on March 17, 2012

It doesn't say python is faster than C++, just that it's easier to write.

antirez · on March 17, 2012

Not strictly related, but in Redis intersections between sets encoded as an array of fixed length 16, 32, or 64 bit numbers (we use this encoding for small sets) may be performed very fast because they are ordered, so you can simply take two pointers in the two sets and advance only taking common elements.

Of course there is a tention between fast intersections (that require a sorted data structure) and O(1) existence test of elements. It depends on what you are trying to accomplish.

DrJosiah · on March 17, 2012

But it's only for limited sized sets, 512 entries with the default configuration. Our sets were millions of entries.

antirez · on March 17, 2012

Sure, but we may implement this soon or later so that small sets will be intersected very fast.

DrJosiah · on March 18, 2012

With caching, 512 intset entries, and 64 bit intset values, that's under 1 microsecond to intersect using the naive binary search algorithm for intersection.

I don't believe that small sets are an issue for performance.

For me, the real question is whether there are ways of getting good performance and lower memory overhead across the entire range of object sizes in Redis.

leon_ · on March 17, 2012

Talks about nanoseconds and ultra fast code ... doesn't lose a word about cache locality?

brazzy · on March 17, 2012

You may want to reread the article. One of his main points is how his custom algorithm accesses the dataset sequentially and thereby massively outperforms bloom filters which need random access.