Of course it begs the question: if the keys are sorted, what do we need an index...

ncmncm · on Jan 25, 2021

Binary search is quite slow on modern hardware, particularly for this use, where you would need to fault in a page for each probe. With a billion records that is 30 probes. They get much better than log2(n).

This is a lot like how you look up words in the dictionary. It is roughly radix-like, but the closer you get, the better the fit. If you are looking up a word that starts with S, you adjust to how broad S is as you home in.

slashdev · on Jan 25, 2021

This is on in memory data. So binary search would seem reasonable except that you can't do inserts, updates, or deletes efficiently in an ordered array. That inevitably leads to using a btree or trie structure.

ncmncm · on Jan 25, 2021

"In-memory" doesn't mean so much as it once did. Faulting a page into cache from RAM takes hundreds or even thousands of cycles. Same for faulting an index page, if the whole index doesn't fit in cache.

A B-tree replaces a log2(N) binary search of the whole range into a K-ary search, with log-base-K(N) probes; but adds searches through the K keys per tree node, which all have to be brought into cache.

Even once the K keys have been brought into cache, a binary search in the index page is quite slow on modern hardware because the iterations involve randomly mispredicted branches.

A great advantage of PGM indexes seems to be that the whole index can be held in (L3) cache. Faulting a line from L3 to L1 cache takes only ~30 cycles. Once the index has been walked, you are close to the desired record and can walk forward or back a few steps to locate it.

If you have to handle insertions, it is often better to keep an overflow table searched first (or last) so you can batch-rebuild during downtime. Deletions may be handled by marking dead entries, and updates are easy. Most multigigabyte tables don't see much churn, relative to their size.

zupa-hu · on Jan 26, 2021

Thanks, that's a very good point!