Third, faster and cheaper storage devices mean that it is better to use faster decoding schemes to reduce computation costs than to pursue more aggressive compression to save I/O bandwidth. Formats should not apply general-purpose block compression by default because the bandwidth savings do not justify the decompression overhead.
Not sure I agree with that. Have a situation right now where I am bottlenecked by IO and not compute.
My point is near the opposite. Data formats should apply lightweight compression, such as lz4, by default because it could be beneficial even if the data is read from RAM.
Last I checked you can't get much better than 1.5GB/s per core with LZ4 (from RAM), up to a maximum ratio < 3:1, and multicore decompression is not really possible unless you manually tweak the compression.
The benchmarks above that are usually misleading, because they assume no dependence between blocks, which is nuts. In real scenarios, blocks need to be parsed, depend on their previous blocks, and you need to carry around that context.
My RAM can deliver close to 20GB/s, and my SSD 7GB/s, and that is all commodity hardware.
Meaning unless you have quite slow disks, you're better off without compression.
> you can partition your dataset and process each partition on separate core, which will produce some massive XX or even XXX GB/s?
Yes, but as I mentioned:
> multicore decompression is not really possible unless you manually tweak the compression
That is, there is no stable implementation out there that does it. You will have to do that manually and painfully. In which case, you're opening the doors for exotic/niche compression/decompression, and there are better alternatives than LZ4 if you're in the niche market.
> this is obviously depends on your data pattern. If it is some low cardinality IDs, they can be compressed by ratio 100 easily.
Everything is possible in theory. Yet we have to agree on what is a reasonable expectation. A compression factor of around 3:1 is, from my experience, what you would get from a reasonable compression speed on reasonably distributed data.
This is extremely common in genomics settings, and in the past I have spent far more time allocating disk iops, network bandwidth, and memory amounts for various pipeline stages than I have on CPUs in this space. Muck up and launch 30x as many processes as your compute node has, and it's fairy fixable, but muck up the RAM allocation and disk IO and you may not be able to fix it in any reasonable time. And if you misallocate your network storage, that can bring the entire cluster to a halt, not just a few nodes.
I think the idea is that you should design tools and pipelines to take advantage of current hardware. Individual nodes have more CPU cores, more RAM, and more and faster local storage than they used to. Instead of launching many small jobs that compete for shared resources, you should have large jobs that run the entire pipeline locally, using network and network storage only when it's unavoidable.
That is exactly right, and optimizing for the current distribution of hardware is always the case; however most interesting problems still do not fit on a single node. For example, large LLMs that whose training data, or sometimes even model itself, do not fit on a single node. Lots of the same principles of allocation show up again.
You mentioned genomics, and that's a field where problems have not grown much over time. You may have more of them, but individual problems are about the same size as before. Most problems have a natural size that depends on the size of the genome. Genomics tools never really embraced distributed computing, because there was no need for the added complexity.
Sure, a 30x human WGS resequencing analysis has gotten pretty trivial over the past decade, but now we also have thousands or millions, plus expression data sets, HI-C, etc. etc. and how to combine them. There may not be compute clusters in genomics labs anymore, because funding agencies will only pay for cloud and not hardware, but there are lots of people working on large scale computation that doesn't fit on a single node.
We actually got a new cluster recently. ~25 nodes with 128 or 192 physical cores, 2 TB RAM, and >10 TB local scratch space each. And most issues arise from the old-school practice of running many small jobs to make scheduling easier. But if you restructure your work to copy the data to local storage, run for a number of hours without accessing the network, and copy the results back to network storage, the issues tend to go away.
Compared to the cluster I was using a decade ago, individual nodes are an order of magnitude faster, they can run an order of magnitude bigger jobs, and local storage is two orders of magnitude faster. Meanwhile, increases in network bandwidth have been modest. I/O has become cheap relative to compute, while network has become a scarce resource.
I struggle to imagine being bandwidth limited in this day and age. Kioxia makes some mean mean SSDs, for not a wild price. A 1u can fit dozens of thee monsters easily.
Nice to see methodology here. Ideally Lancedb lance v2 and nimble would also both be represented here. It feels like there's huge appetite to do better than Parquet; ideally work like this would help inform where we go next.
Nimble (formerly Alpha) is a complicated story. We worked with the Velox team for over a year to open-source and extend it. But plans got stymied by legal. This was in collaboration with Meta + CWI + Nvidia + Voltron. We decided to go a separate path because Nimble code has no spec/docs. Too tightly coupled with Velox/Folly.
Given that, we are working on a new file format. We hope to share our ideas/code later this year.
Honored to have your reply, wonderful view of the scene, thanks Andy.
2c remark, zero horses in this race: I was surprised how few encodings were in Nimble at release. The skeleton superficially seemed fine I guess, I don't know, but not much meat on the bones. Without nice interesting optimized encodings, the container for them doesn't feel compelling. But also starting with some inarguable clear options makes some kind of sense too, is some kind of tactic.
They claim they're trying to figure out a path to decoupling from Velox/Folly, so hopefully that can come about. I tend to believe so, godspeed.
The "implementation not specification" does seem really scary though, isn't how we usually get breakout industry-changimg successes.
I wish I had the savy to contrast lance (V2) vs nimble a little better. Both seem to be containerizing systems, allowing streams to define their own encodings. Your comment about meta-data + encodings makes me feel like there's dimensions to the puzzle I haven't identified yet (mostly after chugging VeloxCon talks).
(Thanks for everything Andy, you're doing the good work (practicing and informing). Very very excited to see ya'll's alternative!!)
You're correct, but for additional context, this paper will actually be presented at VLDB 2024 [0].
> All papers published in this issue will be presented at the 50th International Conference on Very Large Data Bases, Guangzhou, China, 2024.
And that's because in the submission guidelines [1],
> The last three revision deadlines will be May 15, June 1, and July 15, 2023. Note that the June deadline is on the 1st instead of the 15th, and it is the final revision deadline for consideration to present at VLDB 2023; submissions received after this deadline will roll over to VLDB 2024.
So whether it is (2023) or (2024) is a little ambiguous.