Would add that HDFS was a particular nightmare to manage. You had to worry about...

bpodgursky · on June 1, 2021

It's interesting, because I think HDFS (and NameNodes in particular) were impressively engineered for a use-case which didn't quite materialize — ie, very fast metadata queries (they are still much faster than S3 API calls). Turns out that cheap, simple, and massively scalable object storage is just far far more important in practice.

I think there are still a couple use-cases where HDFS dominates S3 (I think some HBase workloads?). But yeah, I scaled up and maintained a 2000+ Hadoop cluster for years, and I would never choose it over object storage if given any plausible alternative.

macksd · on June 1, 2021

This is actually a topic I love to talk about because I spent a lot of my time on S3A and the cloud FileSystem implementations. Fast metadata queries were actually a huge deal for query planning, and of course with performance there were a lot of potential surprises on S3. HBase was (unsurprisingly) heavily dependent on semantics that HDFS has but that are hard to get right on object storage, and required a couple of layers to be able to work properly on S3 (and even then - write-ahead logs were still on a small HDFS cluster last I heard). My biggest complaint about S3 was always eventual consistency (for which Hadoop developed a work-around - it originally employed a lot of worst-practices on S3 and suffered from eventual consistent A LOT) but now that S3 has much better consistency guarantees, I agree: it's incredibly hard to beat something that cheap.

fendale · on June 2, 2021

For a job that needs to access 100's of thousands of small files, the ability to read the meta data quickly is very important.

This is the wider issue with small files. On HDFS each file uses up some namenode memory, but if there are jobs that need to touch 100k+ files (which I have seen plenty of), that puts a real strain on the Namenode too.

I have no experience with S3 to know how it would behave in terms of metadata queries for lots of small objects.

dikei · on June 2, 2021

Small files with S3 is both slow and expensive too. But at least one bad query won't be able to kill your whole cluster like HDFS.

macksd · on June 1, 2021

Yeah I would have loved to see HDFS get really scalable metadata management. I remember hearing about LinkedIn's intentions to really do some significant work there are the last community event I attended, but from their blog post this week it doesn't sound like that's happened since the read-from-standby work [1].

Kerberos (quite popular on big enterprise clusters) is really what makes it hard to get data in / out IMO. I see generic Hadoop connectors in A LOT of third party tools.

[1] https://engineering.linkedin.com/blog/2021/the-exabyte-club-...

fendale · on June 2, 2021

Apache Ozone https://hadoop.apache.org/ozone/ is an attempt to make a more scalable (for small files / metadata) HDFS compatible object store with a S3 interface. Solving the meta data problem in the HDFS namenode will probably never happen now. Too much of the Namenode code expects all the meta data to be in memory. Efforts to overcome the NN scalability have been around "read from standby", which offers impressive results.

The meta data is not the only problem with small files. Massive parallel jobs that need to read tiny files will always be slower than if the files were larger. The overhead of getting the metadata for the file, setting up a connection to do the read is quite large to read only a few 100kb or a few MB.

The other issue with the HDFS namenode, is that it has a single read/write lock to protect all the in memory data. Breaking that lock into a more fine grained set of locks would be a big win, but quite tricky at this stage.