Floating points compress poorly, though.

jakebol · on Oct 26, 2017

Jake from TileDB, Inc. wenc and srean are right. techniques such as those used in zfp and fpzip wenc mentioned are also used to compress real world las file (point cloud) datasets. For the moment we are only focused on lossless compression (scientists are paranoid about losing data), but there is definitely room to explore integration with lossy compression as well. Machine learning applications often do not need full precision so intelligent forms of lossy compression are useful.

Another cool research application of TileDB that extends the storage library with the VP9 codec can be seen here: https://homes.cs.washington.edu/~magda/papers/haynes-sigmod1...

dx034 · on Oct 31, 2017

You can get very good lossless compression with floating point numbers, Facebook's Gorilla paper comes to mind. I usually use it for delta-of-delta encoding which provides very high compression for time series. While that won't really help in your case, their floating point encoding could help compressing matrices quite efficiently.

http://www.vldb.org/pvldb/vol8/p1816-teller.pdf [page 5]

srean · on Oct 26, 2017

There have been a few responses along the lines of 'not always' but what you say is indeed largely true.

There is another thing that is worth considering and that is the algorithms (and even the theory) that works well for compression of discrete sources are not well suited for compressing real numbers (floating point numbers aren't, but, they are the poor man's reals). On the theory side, this bothered Claude Shannon enough that he decided to revisit this later in his career to create rate distortion theory, he knew that there was some unfinished business in information theory.

We do have sort of a chicken and egg problem here, especially when we want store a lot floating points for a ML workload. Learning how to compress and learning the underlying distribution are equivalent problems. If we have already learned the model, then yes we could compress the data well. But when we haven't, then by definition we wouldn't have the knowledge to do a good job of storing the data in a well compressed form. After we have acquired the knowledge to compress well, we don't really need the compressed data anymore to learn the model, we already have it. One way to address this would be to do both incrementally and simultaneously.

wenc · on Oct 26, 2017

Not always [1].

Also in many time-series applications involving sensors whose readings don't fluctuate that much, process historians often apply deadband compression (i.e. store only one value if it is within a certain band). The type of compression is lossy and sometimes a bit controversial for high-fidelity uses, but often results in efficient storage.

[1] zfp, fpzip: https://computation.llnl.gov/projects/floating-point-compres...

dogruck · on Oct 26, 2017

Not always poorly -- especially if you preprocess by calculating running diffs of consecutive values.

redcalx · on Oct 26, 2017

> running diffs of consecutive values

https://en.wikipedia.org/wiki/Delta_encoding