More

rmgraham · on Feb 16, 2022

Tar doesn't use any sort of index like zip does, so to extract the specified file the server side would need to parse through possibly the entire file just to see if the requested file is there, and then start streaming it. Requests for files that aren't in the tar archive would be prohibitively expensive.

There are definitely ways to do it without those problems, though. They just wouldn't be quite as simple as the approach done for supporting zip.

remram · on Feb 16, 2022

You could pre-index them I suppose. Though even that would only work with a subset of compression methods or no compression.

klauspost · on Feb 17, 2022

We considered TAR, but indexing requires reading back and decompressing the entire archive.

This may be feasible on small TAR files, and for single PutObject you could index while uploading. However for multipart objects, parts can arrive in any order so you are forced to read it back. This would lead to unpredictable response times.

Compare that to reading the directory of a zip, which maybe on big files are a couple of megabytes max.

Add to that that tar.gz will require you to decompress from the start to reach any offset. You can recompress while indexing, but an object-store mutating your data is IMO a no-no.

remram · on Feb 17, 2022

S3 is "eventually consistent", so I don't think indexing in the background would be such a big deal. But yeah, like I said this would only work for no-compression or those schemes that are seekable (not gzip).

In any case it is definitely a lot more work than ZIP.

klauspost · on Feb 17, 2022

No, S3, as MinIO, has a read-after-write consistency.

So indexing would block on either writes or reads until it is done. We block when doing the zip indexing, but that is much more lightweight - and we limit to 100MB ZIP directory. That way we don't risk long-blocking index operations.

remram · on Feb 17, 2022

I see. Indeed that is a potentially long time to block.

danudey · on Feb 17, 2022

IIRC gzip can't handle this, but bzip2 can; a guy I know wrote an offline Wikipedia app for the original iPhone and had to crunch things down a lot, and he used bzip2 because you can skip ahead to a chunk without having to process the previous or subsequent chunks.

Then he just had to write some code to index article names based on which chunk(s) they were in, and boom, random-access compressed archive.

blacha · on Feb 16, 2022

This is basically exactly what we do we have created a cloud optimised tar (cotar)[1] by creating a hash index of the files inside the tar.

I work with serving tiled geospatial data [2] (Mapbox vector tiles) to our users as slippy maps where we serve millions of small (mostly <100KB) files to our users, our data only changes weekly so we precompute all the tiles and store them in a tar file in s3.

We compute a index for the tar file then use s3 range requests to serve the tiles to our users, this means we can generally fetch a tile from s3 with 2 (or 1 if the index is cached) requests to s3 (generally ~20-50ms).

To get full coverage of the world with map box vector tiles it is around 270M tiles and a ~90GB tar file which can be computed from open street map data [3]

> Though even that would only work with a subset of compression methods or no compression.

We compress the individual files as a work around, there are options for indexing a compressed (gzip) tar file but the benefits of a compressed tar vs compressed files are small for our use case

[1] https://github.com/linz/cotar (or wip rust version https://github.com/blacha/cotar-rs) [2] https://github.com/linz/basemaps or https://basemaps.linz.govt.nz [3] https://github.com/onthegomap/planetiler

remram · on Feb 16, 2022

Why not upload those files separately, or in ZIP format?

blacha · on Feb 16, 2022

> Why not upload those files separately,

Doing S3 put requests for 260M files every week would cost around $1300 USD/week which was too much for our budget

> or in ZIP format?

We looked at zip's but due to the way the header (well central file directory) was laid out it mean that finding a specific file inside the zip would require the system to download most of the CFD.

The zip CFD is basically a list of header entries where they vary in size of 30 bytes + file_name length, to find a specific file you have to iterate the CFD until you find the file you want.

assuming you have a smallish archive (~1 million files) the CFD for the zip would be somewhere in the order of 50MB+ (depending on filename length)

Using a hash index you know exactly where in the index you need to look for the header entry, so you can use a range request to load the header entry

  offset = hash(file_name) % slot_count

Another file format which is gaining popularity recently is PMTiles[1] which uses tree index, however it is specifically for tiled geospatial data.

[1] https://github.com/protomaps/PMTiles

klauspost · on Feb 17, 2022

Nice tools!

When it is serverside, reading a 50MB CFD is a small task. And once it is read we can store the zipindex for even faster access.

We made 'zipindex' to purposely be a sparse, compact, but still reasonably fast representation of the CFD - just enough to be able to serve the file. Typically it is around a 8:1 reduction on the CFD, but it of course depends a lot on your file names as you say (the index is zstandard compressed).

Access time from fully compressed data to a random file entry is around 100ms with 1M files. Obviously if you keep the index in memory, it is much less. This time is pretty much linear which is why we recommend aiming for 10K file per archive, which makes the impact pretty minimal.

remram · on Feb 17, 2022

You mean the cost of the PUT requests becomes significant. That makes sense since AWS doesn't charge for incoming bandwidth. Thanks!

rmgraham · on Nov 23, 2019

I think the observation being made was that this is at the scary intersection of consumer IoT non-security and major infrastructure.

rmgraham · on Nov 4, 2019

Or only run Linux? *

* (any source available OS)

tracker1 · on Nov 4, 2019

GP says not to run anything you haven't looked at yourself for the most part, paraphrasing. I doubt anyone has self audited all the software and drivers going into a desktop Linux distro.

The point is, at some point you stop digging

geofft · on Nov 4, 2019

> GP says not to run anything you haven't looked at yourself for the most part, paraphrasing.

That is a highly inaccurate paraphrase of what I said. You can tell it's inaccurate because I'm specifically suggesting that end users should be able to trust apps on the App Store without auditing the apps ourselves, as long as we trust the app authors, and that app authors should bear responsibility for what they redistribute.

de_watcher · on Nov 4, 2019

Desktop Linux distros have package maintainers and companies behind them like RedHat or Canonical.

toast0 · on Nov 4, 2019

GP says not to publish anything you haven't looked at (or OKed by appeal to authority). Publishing should be a higher standard than running.

rmgraham · on Nov 3, 2019

This is a great resource. There were also some other useful links in the comments last time: https://news.ycombinator.com/item?id=17054419

rmgraham · on Oct 30, 2019

I haven't used hg enough to have an opinion on it, despite several attempts... Problem is I learn bottom up, and I just haven't been able to "think in mercurial" the way I can "think in git".

I find it interesting that what git calls a commit is actually a revision (or checkpoint, snapshot, point-in-time) and what mercurial calls a revision is actually a commit (or patch, delta, changeset).

I think a lot of people think in terms of patches/changesets and I suspect (still haven't gotten far enough to confirm) hg is a toolbox for managing them in a similar way to how git is a toolbox for manipulating its snapshot based DAG.

rmgraham · on Oct 16, 2019

Alternatively... how large would the whale population need to be to counter the carbon footprint of such tourism?

rmgraham · on Oct 8, 2019

If you're using rebase there's the `--committer-date-is-author-date` and `--ignore-date` flags. One uses the author date for both and the other uses the commit date for both.

Without using either flag rebase should update the commit date and preserve the author date.

If by rebase you meant GitHub's rebase merge option I think you're out of luck :-/

johnisgood · on Oct 9, 2019

They might be just what I am looking for! I will check them out soon. Thank you! :)

rmgraham · on Aug 16, 2019

Are those really the only options? I'm trying to wrap my head around how using a fixed size thread pool for I/O automatically implies deadlocks but I just can't. Unless the threads block on completion until their results are consumed instead of just notifying and then taking the next task..

I can definitely imagine blocking happening while waiting for a worker to be available, though. Did you mean simply blocking instead of deadlock?

spullara · on Aug 16, 2019

N threads, with N readers waiting for a message that will only come if the N+1 reader (still in the queue) gets a message first.

rmgraham · on Aug 17, 2019

Thank you for humoring me. I had to sleep on it, but I can see it now. Seems like it would require a really bad design or more likely bad actors (remotes leaving dead sockets open), but it would definitely be possible.

The same scenarios would lead to resource exhaustion if the thread pool wasn't bounded.

nine_k · on Aug 17, 2019

But sure one must use an output queue, not synchronously wait for the consumer to consume a result?

spullara · on Aug 17, 2019

The N + 1 readers are all reading different sockets, blocked.

rmgraham · on July 4, 2018

GAN, Generative Adversarial Network: https://en.wikipedia.org/wiki/Generative_adversarial_network

rmgraham · on July 1, 2018

I suspect "Western-style dates" was meant to refer to the Gregorian calendar rather than the format used to represent a date on that calendar.

ISO-8601 only really applies to the Gregorian calendar.