Show HN: Fast Rolling Quantiles for Python

alexhutcheson · on March 2, 2021

This is pretty cool. The title would be a bit more descriptive if it were “Fast Rolling Quantile Filters for Python”, since the high-pass/low-pass filter functionality seems to be the focus.

The README mentions it uses binary heaps - if you’re willing to accept some (bounded) approximation, then it should be possible to reduce memory usage and somewhat reduce runtime by using a sketching data structure like Dunning’s t-digest: https://github.com/tdunning/t-digest/blob/main/docs/t-digest....

There is an open source Python implementation, although I haven’t used it and can’t vouch for its quality: https://github.com/CamDavidsonPilon/tdigest

cb321 · on March 2, 2021

You can also just get/use exact answers with the `accumulation_tree` package that Python `tdigest` impl uses, though for large windows a B-tree is faster than a red-black ranked/instrumented binary tree. Either tree also scales better than heap pairs to multiple quantiles per moving window (like 5%, 50%, 95%). Python's `blist` package may not have the exact API you need to build from, but it is very close. (`blist` also does not optimize for the "histogramming case" where there can be many duplicate values, IIRC. Instead each value gets its own independent slot in the data structure.)

myrl · on March 2, 2021

Thanks, I'll be sure to check these out! It's a really neat idea behind that paper.

aldanor · on March 2, 2021

Might be worth mentioning: `bottleneck` package, implemented in C, with various moving window functions - https://bottleneck.readthedocs.io/en/latest/reference.html#m...

SatvikBeri · on March 2, 2021

Awesome! Rolling quantiles were a major bottleneck in my work a couple years ago, so much so that I ended up removing them and using approximations instead. It'll be good to try this if I ever go back to that code.

AyrtonB · on March 2, 2021

What's the advantage of this over Pandas' standard rolling quantiles?

import numpy as np

import pandas as pd

s_input = pd.Series(np.random.randn(1000))

s_p10 = s_input.rolling(10).quantile(0.1)

nvilcins · on March 2, 2021

See the "benchmarks" section: https://github.com/marmarelis/rolling-quantiles#benchmarking...

xiphias2 · on March 2, 2021

Wouldn’t it make more sense to speed up the current APIs used by lots of programs over introducing a new API?

myrl · on March 2, 2021

Perhaps, but as far as I'm aware, none of them are as flexible as my interface (to support e.g. streaming, pipelining.) I have not looked at all of pandas' time-series functionality that closely.

xiphias2 · on March 2, 2021

I see, it's interesting that ypu wrote it all in C, pandas in Python