Small File Archives in MinIO

gopalv · on Feb 16, 2022

So this looks a lot like what Hadoop did with .har files[1] on HDFS (like storing GPS tiles on HDFS without blowing through the 1 M files per-dir limit).

> It is not possible to update individual files inside the ZIP file. Therefore this should only be used for data that isn’t expected to change.

I've actually done file-replaces on .zip files on HDFS, because .ZIP files are actually written with a directory in a footer, you can go to the end and append new data without having to "modify" existing files.

This doesn't conflict with the block level immutability, though the entire write has to be a single commit to avoid leaving the file in a bad way.

I'd say that the best case use-case for this is the storage of log files (like if you had fluentd writing .zip files by appending to it rather than a diff object for each 5 minute window).

When it comes to stuff which compresses well but full of small objects, ZIP is pretty bad because each file in the zip independently contains a dictionary (look at the .xlsx file inside to know how MSFT solved that, but in way which makes you hate it - a strings directory for shared strings across all files).

[1] - https://hadoop.apache.org/docs/r1.2.1/hadoop_archives.html#H...

klauspost · on Feb 17, 2022

> When it comes to stuff which compresses well but full of small objects, ZIP is pretty bad

Correct, but it also allows you to independently access files, which is a win for this use-case. The goal isn't the compression itself, but reducing the number of objects, which in itself reduces the file system block overhead.

Double checking, it seems like files compressed with zstandard method aren't supported. This will (when enabled) give both better compression and faster decompression. That should be added shortly.

heipei · on Feb 16, 2022

Backblaze, S3 and R2 listen up: This is a great feature and one which I would immediately use to store related data in one zip file. Not only does this cut down on the number of objects in your storage, it also makes deletion atomic since you don't have to ensure that all types of objects for the same ID have been deleted.

toomuchtodo · on Feb 16, 2022

I have wanted this forever from S3 (cherry-picking from stored object archives) without having to implement reading/caching the zip index and doing a range request. This is awesome, solid feature.

klauspost · on Feb 17, 2022

(dev here) Thanks!

The initial idea was a customer request was to specify ranges as key=x->y;key2=z->w when uploading a file via a header and then specifying the key as a substitute for Range on get.

We chewed a bit on that and found that the zip method was much more flexible and easy to use, where the file name effectively is the key. Either would require client-side changes, so we might as well make it as easy to use as possible.

laurencerowe · on Feb 16, 2022

While not zip S3 does now support SELECT into a number of datatypes. https://docs.aws.amazon.com/AmazonS3/latest/userguide/select...

klauspost · on Feb 17, 2022

Correct, but it pretty much forces a scan through the file. For AWS, you pay for scanning the file, even if only a few results are returned. Currently this is $0.002 per scanned GB, plus request and transfer fee.

For MinIO you "pay" in CPU time and disk iops to scan the file.

So for retrieving single values it is a very inefficient/expensive method.

jaytaylor · on Feb 16, 2022

Would this approach also work for Tar archives? Transparent support for sub-files from a .tar would be badass.

rmgraham · on Feb 16, 2022

Tar doesn't use any sort of index like zip does, so to extract the specified file the server side would need to parse through possibly the entire file just to see if the requested file is there, and then start streaming it. Requests for files that aren't in the tar archive would be prohibitively expensive.

There are definitely ways to do it without those problems, though. They just wouldn't be quite as simple as the approach done for supporting zip.

remram · on Feb 16, 2022

You could pre-index them I suppose. Though even that would only work with a subset of compression methods or no compression.

klauspost · on Feb 17, 2022

We considered TAR, but indexing requires reading back and decompressing the entire archive.

This may be feasible on small TAR files, and for single PutObject you could index while uploading. However for multipart objects, parts can arrive in any order so you are forced to read it back. This would lead to unpredictable response times.

Compare that to reading the directory of a zip, which maybe on big files are a couple of megabytes max.

Add to that that tar.gz will require you to decompress from the start to reach any offset. You can recompress while indexing, but an object-store mutating your data is IMO a no-no.

remram · on Feb 17, 2022

S3 is "eventually consistent", so I don't think indexing in the background would be such a big deal. But yeah, like I said this would only work for no-compression or those schemes that are seekable (not gzip).

In any case it is definitely a lot more work than ZIP.

klauspost · on Feb 17, 2022

No, S3, as MinIO, has a read-after-write consistency.

So indexing would block on either writes or reads until it is done. We block when doing the zip indexing, but that is much more lightweight - and we limit to 100MB ZIP directory. That way we don't risk long-blocking index operations.

remram · on Feb 17, 2022

I see. Indeed that is a potentially long time to block.

danudey · on Feb 17, 2022

IIRC gzip can't handle this, but bzip2 can; a guy I know wrote an offline Wikipedia app for the original iPhone and had to crunch things down a lot, and he used bzip2 because you can skip ahead to a chunk without having to process the previous or subsequent chunks.

Then he just had to write some code to index article names based on which chunk(s) they were in, and boom, random-access compressed archive.

blacha · on Feb 16, 2022

This is basically exactly what we do we have created a cloud optimised tar (cotar)[1] by creating a hash index of the files inside the tar.

I work with serving tiled geospatial data [2] (Mapbox vector tiles) to our users as slippy maps where we serve millions of small (mostly <100KB) files to our users, our data only changes weekly so we precompute all the tiles and store them in a tar file in s3.

We compute a index for the tar file then use s3 range requests to serve the tiles to our users, this means we can generally fetch a tile from s3 with 2 (or 1 if the index is cached) requests to s3 (generally ~20-50ms).

To get full coverage of the world with map box vector tiles it is around 270M tiles and a ~90GB tar file which can be computed from open street map data [3]

> Though even that would only work with a subset of compression methods or no compression.

We compress the individual files as a work around, there are options for indexing a compressed (gzip) tar file but the benefits of a compressed tar vs compressed files are small for our use case

[1] https://github.com/linz/cotar (or wip rust version https://github.com/blacha/cotar-rs) [2] https://github.com/linz/basemaps or https://basemaps.linz.govt.nz [3] https://github.com/onthegomap/planetiler

remram · on Feb 16, 2022

Why not upload those files separately, or in ZIP format?

blacha · on Feb 16, 2022

> Why not upload those files separately,

Doing S3 put requests for 260M files every week would cost around $1300 USD/week which was too much for our budget

> or in ZIP format?

We looked at zip's but due to the way the header (well central file directory) was laid out it mean that finding a specific file inside the zip would require the system to download most of the CFD.

The zip CFD is basically a list of header entries where they vary in size of 30 bytes + file_name length, to find a specific file you have to iterate the CFD until you find the file you want.

assuming you have a smallish archive (~1 million files) the CFD for the zip would be somewhere in the order of 50MB+ (depending on filename length)

Using a hash index you know exactly where in the index you need to look for the header entry, so you can use a range request to load the header entry

  offset = hash(file_name) % slot_count

Another file format which is gaining popularity recently is PMTiles[1] which uses tree index, however it is specifically for tiled geospatial data.

[1] https://github.com/protomaps/PMTiles

klauspost · on Feb 17, 2022

Nice tools!

When it is serverside, reading a 50MB CFD is a small task. And once it is read we can store the zipindex for even faster access.

We made 'zipindex' to purposely be a sparse, compact, but still reasonably fast representation of the CFD - just enough to be able to serve the file. Typically it is around a 8:1 reduction on the CFD, but it of course depends a lot on your file names as you say (the index is zstandard compressed).

Access time from fully compressed data to a random file entry is around 100ms with 1M files. Obviously if you keep the index in memory, it is much less. This time is pretty much linear which is why we recommend aiming for 10K file per archive, which makes the impact pretty minimal.

remram · on Feb 17, 2022

You mean the cost of the PUT requests becomes significant. That makes sense since AWS doesn't charge for incoming bandwidth. Thanks!

klauspost · on Feb 17, 2022

As mentioned in the blog post we do not expect to support TAR archives.

There is no central directory in TAR, so indexing would be very expensive, and we would not be able to support tar.gz anyway.

ZIP was chosen mainly because it fits with the feature set. The alternative was that we developed a custom format or took another approach completely. This was the solution that in our eyes would be the easiest to implement for clients.

speedgoose · on Feb 16, 2022

Would it work in gateway mode ?

y4m4b4 · on Feb 16, 2022

No, gateway mode is EOL'ed and it is not taking any new features

speedgoose · on Feb 17, 2022

Oh I see. I checked the repository and it's being stopped very soon: https://github.com/minio/minio/issues/14331