So is any of S3 powered by SSD's? I honestly figured that it must be powered by ...

yabones · 2025-09-24T20:52:58 1758747178

The storage itself is probably (mostly) on HDDs, but I'd imagine metadata, indices, etc are stored on much faster flash storage. At least, that's the common advice for small-ish Ceph cluster MDS servers. Obviously S3 is a few orders of magnitude bigger than that...

rubiquity · 2025-09-24T14:40:25 1758724825

> So is any of S3 powered by SSD's?

S3’s KeyMap Index uses SSDs. I also wouldn’t be surprised if at this point SSDs are somewhere along the read path for caching hot objects or in the new one zone product.

pjdesno · 2025-09-24T21:39:44 1758749984

Repeating a comment I made above - for standard tier, requests are expensive enough that it's cost-effective to let space on the disks go unused if someone wants an IOPS/TB ratio that's higher than what disk drives can provide. But not much more expensive than that.

The latest generation of drives store about 30TB - I don't know how much AWS pays for them, but a wild-ass guess would be $300-$500. That's a lot cheaper than 30TB of SSD.

Also important - you can put those disks in high-density systems (e.g. 100 drives in 4U) that only add maybe 25% to the total cost, at least if you're AWS, a bit more for the rest of us. The per-slot cost of boxes that hold lots of SSDs seems to be a lot higher.

electroly · 2025-09-24T14:58:40 1758725920

It's assumed that the new S3 Express One Zone is backed by SSDs but I believe Amazon doesn't say so explicitly.

rubiquity · 2025-09-24T18:48:04 1758739684

I've always felt it's probably a wrapper around the Amazon EFS due to the similar pricing and that S3 One Zone has "Directory" buckets, a very file system-y idea.

pjdesno · 2025-09-25T03:05:35 1758769535

Seems to indicate the storage underneath might be similar in cost and performance, and this might in fact really be similar. Not that the software on top is the same.

ed_g · 2025-09-24T23:19:47 1758755987

MDGeist · 2025-09-24T13:00:53 1758718853

I always assumed the really slow tiers were tape.

derefr · 2025-09-24T17:50:04 1758736204

My own assumption was always that the cold tiers are managed by a tape robot, but managing offlined HDDs rather than actual tapes.

jasonwatkinspdx · 2025-09-25T04:43:50 1758775430

Yeah, I don't know about S3, but years back I talked a fair bit with someone that did storage stuff for HPC, and one thing he talked about is building huge JBOD arrays where only a handful of disks per rack would be spun up, basically pushing what could be done with scsi extenders or such. It wouldn't surprise me if they're doing something like that with batch scheduling the drive activations over a minutes to hours window.

Demiurge · 2025-09-25T03:54:18 1758772458

There was an article or interview with one of the lead AWS engineers, and he said they use CDs or DVDs for cold glacier.

chippiewill · 2025-09-24T21:17:38 1758748658

I think that's close to the truth. IIRC it's something like a massive cluster of machines that are effectively powered off 99% of the time with a careful sharding scheme where they're turned on and off in batches over a long period of time for periodic backup or restore of blobs.

booi · 2025-09-24T23:22:55 1758756175

it's amazing that Glacier is such a huge system with so many people working on it and it's still a public mystery how it works. I've not seen a single confirmation of how it works..

ignoramous · 2025-09-25T01:31:59 1758763919

Glacier could be doing similar to what Azure does: https://www.microsoft.com/en-us/research/project/project-sil...

Also see this thread: https://news.ycombinator.com/item?id=13011396

pyuser583 · 2025-09-25T02:21:38 1758766898

I doubt it’s using WORM drives.

hobs · 2025-09-24T14:21:40 1758723700

Not even the higher tiers of Glacier were tape afaict (at least when it was first created), just the observation that hard drives are much bigger than you can reasonably access in useful time.

temp0826 · 2025-09-24T16:39:40 1758731980

In the early days when there were articles speculating on what Glacier was backed by, it was actually on crusty old S3 gear (and at the very beginning, it was just on S3 itself as a wrapper and a hand wavy price discount, eating the costs to get people to buy in to the idea!). Later on (2018 or so) they began moving to a home grown tape-based solution (at least for some tiers).

everfrustrated · 2025-09-24T21:23:36 1758749016

I'm not aware of AWS ever confirming tape for glacier. My own speculation is they likely use hdd for glacier - especially so for the smaller regions - and eat the cost.

Someone recently came across some planning documents filed in London for a small "datacenter" which wasn't attached to their usual London compute DCs and built to house tape libraries (this was explicitly called out as there was concern about power - tape libraries don't use much). So I would be fairly confident they wait until the glacier volumes grow enough on hdd before building out tape infra.

luhn · 2025-09-24T19:27:45 1758742065

Do you have any sources for that? I'm really curious about Glacier's infrastructure and AWS has been notoriously tight-lipped about it. I haven't found anything better than informed speculation.

iamtedd · 2025-09-24T19:36:59 1758742619

My speculation: writes are to /dev/null, and the fact that reads are expensive and that you need to inventory your data before reading means Amazon is recreating your data from network transfer logs.

luhn · 2025-09-24T19:47:21 1758743241

Maybe they ask the NSA for a copy.

temp0826 · 2025-09-25T03:35:09 1758771309

Source is SWIM who worked there (doubt any of that stuff has been published)

dekhn · 2025-09-24T19:55:12 1758743712

That's surprising given how badly restoration worked (much more like tape than drives).

roncesvalles · 2025-09-24T20:19:22 1758745162

I'd be curious whether simulating a shitty restoration experience was part of the emulation when they first ran Glacier on plain S3 to test the market.

pjdesno · 2025-09-25T03:09:18 1758769758

The “drain time” for a 30TB drive is probably between 36 and 48 hours. I don’t have one in my lab to test, or the patience to do so if I did.

hobs · 2025-09-25T04:35:12 1758774912

Yep, did 20TB drives in my Unraid box and took about 2 days and some change to setup a clean sync between em all :)

g-mork · 2025-09-24T15:23:14 1758727394

There might be surprisingly little value in going tape due to all the specialization required. As the other comment suggest, many of the lower tiers likely represent basically IO bandwidth classes. a 16 TB disk with 100 IOPs can only offer 1 IOP/s over 1.6 TB for 100 customers, or 0.1 IOP/s over 160 GB for 1000, etc. Just scale up that thinking to a building full of disks, it still applies

smueller1234 · 2025-09-25T11:26:57 1758799617

I realize you're making a general point about space/IO ratios and the below is orthogonal, no contradiction.

It's actually a lot less user-facing per disk IO capacity that you will be able to "sell" in a large distributed storage system. There's constant maintenance churn to keep data available: - local hardware failure - planned larger scale maintenance - transient, unplanned larger scale failures (etc)

In general, you can fall back to using reconstruction from the erasure codes for serving during degradation. But that's a) enormously expensive in IO and CPU and b) you carry higher availability and/or durability risk because you lost redundancy.

Additionally, it may make sense to rebalance where data lives for optimal read throughput (and other performance reasons).

So in practice, there's constant rebalancing going on in a sophisticated distributed storage system that takes a good chunk of your HDD IOPS.

This + garbage collection also makes tape really unattractive for all but very static archives.

pjdesno · 2025-09-24T21:57:09 1758751029

See comments above about AWS per-request cost - if your customers want higher performance, they'll pay enough to let AWS waste some of that space and earn a profit on it.

UltraSane · 2025-09-24T22:25:02 1758752702

I expect they are storing metadata on SSDs. They might have SSD caches for really hot objects that get read a lot.

ed_g · 2025-09-24T23:23:16 1758756196

Std has the same performance as every other storage class. There are 2 async classes which you can't read from without retrieving first, but that's not a 'performance' difference as such - GETs aren't slow, they fail.