Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
What every programmer should know about solid-state drives (2014) (codecapsule.com)
170 points by fagnerbrack on May 24, 2022 | hide | past | favorite | 110 comments


I feel like maybe this is "what all filesystem developers should now about solid-state drivers"; not very obvious how most other developers would interact with a device at the level of abstraction where they have the kind of necessary control.


If some typical write pattern from a typical app is wearing out the SSD really fast, I'd say that's the SSD firmware engineer's problem? And I think they've actually done a great job in general, judging by the typical lifespan of SSDs and the typically great performance. I'd argue that if the drive is designed correctly, most programmers shouldn't have to care about low level details. (I did say MOST).


I think you misspelled "it's the user's problem". I don't think most companies care until it becomes something that materially affects them. Until then, users are reliant on the developers of the applications they use to make up for the deficiencies in lower layers.


A reputation for drives that fail faster than their competitors will definitely effect them!


> A reputation for drives that fail faster than their competitors

How can they get that if they stuff enough fake reviews, plus the legion of consumers who would have no idea that the drive was the issue and not "viruses".


Neither of those apply to enterprise customers.


That's a completely different market, to which are marketed different drives.


True mostly! OEMs are the largest non-enterprise market, and they do (roughly) the same types of testing and validation a enterprise customer would though.

No one is going to be selling a million laptops with a drive from RandoDriveManuGoodBrand off Amazon with no track record and no validation.

Anyone buying the non-name brand types of drives knows they're getting (at best) something that might only work a little while before exploding.

The name brands like Samsung, et. al. work hard to make their firmware not grenade something (and the drives overall to be AT LEAST as reliable as their competitors) BECAUSE they want the name to mean something. It is what drives customers their way, most of the time.

If they get a reputation as a company selling junk (cough Deskstar/Deathstar) that costs them billions over many years.


The firmware's job is wear leveling - making sure all the sectors wear out at about the same time, which they do a great job at. But SSDs can write so fast that you can burn out the drive in months (maybe even weeks?) if you wanted to. There's nothing the firmware can do fix the limitations of flash itself. The important thing to keep in mind is that for write heavy workloads, you need to keep write amplification in mind.

I remember an adjacent team to mine that had to store several gigs of data which changed often, but only a small percent changed at any one time. They needed to recover quickly from a crash so they wrote it to disk. But they wrote the entire data set out to disk after every update, instead of keeping it in e.g. rocksdb or even sqlite. Their entire fleet burnt through their SSDs at about the same rate, so machines were dying in rapid succession, ouch. Write amplification is a real problem, but SSDs great performance often masks it being an issue until down the road.


> But SSDs can write so fast that you can burn out the drive in months (maybe even weeks?)

You can burn out a modern consumer drive in 2 days if you want to. Write perf ~6 gb/s, mtbf 700 tb written on a 1 tb drive. The tlc/qlc cells have very poor endurance imo.


I am surprised the article doesn't mention monitoring TBW (TerraBytesWritten). I found it a good indicator for how much data is actually written to the SSD. In my case, I decided to buy cheap consumer SSDs (from WD), because I calculated I have only about 50TBW per year on my VM-drive, a ZFS-Mirror. In reality, it is even less - so far 14TBW in 2022, with 16 VMs. See a blog post here for how I monitor the stats in InfluxDB [1]. WD says the drive has an average lifetime of about 300 to 400 TBW, so I can expect at least another 5 years.

[1]: https://du.nkel.dev/blog/2021-05-05_proxmox_influxdb/#config...


"Storage study finds SSDs might not be much more reliable than HDDs after all"

https://www.pcgamer.com/storage-study-finds-ssds-might-not-b...

"Are SSDs Really More Reliable Than Hard Drives?"

https://www.backblaze.com/blog/are-ssds-really-more-reliable...


> Cells are grouped into a grid, called a block, and blocks are grouped into planes. The smallest unit through which a block can be read or written is a page. Pages cannot be erased individually, only whole blocks can be erased. The size of a NAND-flash page size can vary, and most drive have pages of size 2 KB, 4 KB, 8 KB or 16 KB. Most SSDs have blocks of 128 or 256 pages, which means that the size of a block can vary between 256 KB and 4 MB. For example, the Samsung SSD 840 EVO has blocks of size 2048 KB, and each block contains 256 pages of 8 KB each.

Very confusing and might be incorrect. What are planes. And are pages made out of blocks or vice-versa? If blocks are grouped in pages, with erasing it sounds very different.. Only whole blocks, which sounds like blocks are bigger than pages.


It's correct.

Planes reflect the physical structure of the storage chips: there's multiple layers that share a common vertical bus.

Plane > Block > Page, that is to say Blocks are always made up of multiple pages (commonly 128 or 256 as the quote mentions). Pages are the unit of read and write, while blocks are the unit of erasure. The FTL tries to hide this page write vs block erase mismatch as best it can, but as the original article points out you may need to be aware of what it's doing in very high performance systems.


A single NAND die is only divided into two or four planes. It's a function of how many copies of the peripheral circuitry for accessing the array are included, not how many layers are in the 3D NAND array. More planes means the die can do more things in parallel (subject to constraints).

A drive with 8 dies each having 512Gbit capacity divided into four planes per die will perform almost as well as one with 16 dies of 256Gbit divided into two planes, other things being equal (eg. number and speed of the channels between the SSD controller and the NAND, page and block sizes and access times, all of which are subject to change at the same time a generational change increases die capacity and number of planes).


Ok that makes more sense. A small diagram outlining the 3d structure would be helpful.


Page = minimum read/write unit, block = minimum erase unit. Blocks are composed of some integer number of pages. Planes don't matter (probably).


Planes allow you to perform multiple parallel operations a single die (assuming it's the same as the raw SLC I work with).


> Splitting cold and hot data as much as possible into separate pages will make the job of the garbage collector easier.

How do I tell my SSD to write stuff to specific pages? You can't really tell the SSD to do anything except read, write, or trim LBAs.

Does NVMe support this with its queues?

> 27. Over-provisioning is useful for wear leveling and performance

I thought most if not all SSDs were already overprovisioned. Does additional overprovisioning help?

> To ensure that logical writes are truly aligned to the physical memory, you must align the partition to the NAND-flash page size of the drive.

I think this is false. This assumes there is a one-to-one mapping of LBA to SSD PBA which you don't know. LBA 2048 could go to any PBA on any page/block/flash line in the unit and as things are written and rewritten, any correspondence that might happen due to sequential assignment of PBAs->LBAs would gradually diminish, IF you knew for sure that was happening in the first place. Because you wouldn't really know what the SSD is doing without reverse engineering or seeing the source code of firmware, unless there's things going on in NVMe land that are new and I don't yet know.


I wrote a series of articles that covered the new features defined for NVMe drives. The general pattern is that there are now lots of optional hints that drives and host systems can exchange about data placement, alignment and lifetime. But there are also alternative paradigms available like Zoned Storage that break compatibility to offer explicit control. These features are mostly only implemented in enterprise SSDs, and often only if a big customer specifically asks for them.

https://www.anandtech.com/show/11436/nvme-13-specification-p...

https://www.anandtech.com/show/14543/nvme-14-specification-p...

https://www.anandtech.com/show/16702/nvme-20-specification-r...

https://www.anandtech.com/show/15959/nvme-zoned-namespaces-e...


>I thought most if not all SSDs were already overprovisioned. Does additional overprovisioning help?

I think a big extra helping of overprovisioning is one of the major differences between consumer and enterprise SSDs.


I've been thinking about the possibility of "dumb" SSD devices.

All of the current HW-level performance hacks could actually get in the way if your software already enforces things like single writer, chunky writes and/or append-only log structures.

Give me a drive that only writes in 1 linear direction (until its full) and has a big red button to clean the entire thing all at once (which would clearly require some offline processing time & multiple disks for a realistic system).


Does the ZNS (Zoned Namespaces) spec come close enough?

https://nvmexpress.org/new-nvmetm-specification-defines-zone...


Yes, actually. This looks like a realistic/practical path. Had no idea this was a thing.


There is more technical information at zonedstorage.io which also offers drives for academia and open-source projects.

https://zonedstorage.io/docs/community/devices


I think that's roughly what the flash storage modules on Apple's new Mac Studio are.


Have you seen SMR spinning disks? You can get them today in host-managed flavors.


Sure! Go ahead and order some memory cells.


From a low level programmatic standpoint, managing size and alignment with (potentially unknown) page sizes throws the same challenges as for AV buffers and network packet MTU/sizes - either side of "just right" is suboptimal.


From Wikipedia:

> In December 2012, Taiwanese engineers from Macronix revealed their intention to announce at the 2012 IEEE International Electron Devices Meeting that they had figured out how to improve NAND flash storage read/write cycles from 10,000 to 100 million cycles using a "self-healing" process that used a flash chip with "onboard heaters that could anneal small groups of memory cells."

So can I apply this myself by placing an SSD drive in an oven?


Yes, if you have manufacturer software to factory format blank drives. Heating up heals cells being written (probably filled as writing empties cells while erasing stores max charge value), but also speeds up data degradation in all the other cells not being written to.


Related:

What every programmer should know about solid-state drives - https://news.ycombinator.com/item?id=9049630 - Feb 2015 (31 comments)


A question: do you have a tool that searches the history for previous links, or do you just have a really good memory?


Here's a pointer to past explanations: https://news.ycombinator.com/item?id=29370676.


There's a "past" link on every HN story that shows you previous submissions of the same story.


What sorts of programmers should be concerned about these matters? Page cache doesn't seem too important or interesting in my day to day app and distributed systems development.

Maybe it's useful if you want to make something like a more performant version of grep? (aka ripgrep?)


> Page cache doesn't seem too important or interesting in my day to day app and distributed systems development

This is why we can't have nice things.


How so? Isn't the only point of developing these systems and abstractions so that other people don't have to worry about them?


IMHO, today to many people think "don't have to worry about them" equals "don't need to know anything about it".


I would argue that in most cases you "don't need to know anything about it" either. It's reasonable to deliberately treat abstractions as if they are not leaky, as long as you're aware that all abstractions in fact are leaky and you're equipped to investigate and learn about them if the leaks cause problems.


“don't need to know anything about it” is acceptable, but should not be encouraged.

It’s not like reading 10 bullet points on the subject is “diving deep” and making huge time investment.

It’s just getting the minimal context, so later on at least some keywords are known.


> It’s not like reading 10 bullet points on the subject is “diving deep” and making huge time investment.

True, but you're using so many abstractions that the rule can't feasibly be "read a short summary of every abstraction you're using." There are just too many. At some point you have to choose a threshold where the likelihood of an abstraction leakage is sufficiently low. When you're debugging a CSS selector you will almost certainly never need to know about even the existence of, say, Fermi–Dirac statistics.


> True, but you're using so many abstractions that the rule can't feasibly be "read a short summary of every abstraction you're using."

Rule - no. Goal - yes.

Some topics are more stable and valuable then others, so prioritisation helps.

“How utf8 generally works” vs “implementation details of js-node-utf-related-library-X.”


10 bullet points on every conceivable computer-related topic is, well, a lot more than 10.


One topic at high level (like in the article, 10-20 minutes?) per week, results in ~50 topics per year.

Not sure how many computer related topics you know/want (“The more you know, the more you know you don't know”), but for me, 50 topics on programming seems sufficiently high at frankly a very low effort/commitment.


Fair enough, I’m just not sure how many years would go by before I even thought about SSD performance. Irrelevant to most of my career.


I love how people say this, when the reality is, all the software from the oh-so-coveted is the biggest shit show I’ve seen.

But it’s rarely because some developer didn’t understand page caches, and usually because it obviously didn’t revive enough QA or UX input.


People who read from disks and people who write to them. How SSDs organize data definitely had read and write performance implications and if you're writing to disk, some write habits that are perfectly reasonable on regular disks can cause catastrophically fast wear on SSDs.


Yes, but the number of people who need to be worried about aligning their writes and such is pretty small; certainly not "every" programmer. The author gets into the weeds about certain things application level programmers almost never need to know or concern themselves about. He really doesn't understand what's useful information and what isn't.

If you're programming at enterprise scale, this sort of stuff is the responsibility of architect-level programmers and senior systems engineers.

Even most linux sysadmins know all about block alignment (well, if they predate most of the various tools figuring out block size/alignment stuff for you.) It's nothing new - RAID arrays work best when properly aligned, for example.


> doesn't seem too important or interesting in my day to day app and distributed systems development.

Makes sense to me. At Google we were told to stop thinking about all this stuff, that the storage hardware and software people were responsible for hiding things like wearout from application developers. This article is really "things you should know if you plan to directly access an NVMe device" but there is a huge class of programmers who are better off not knowing.


>At Google we were told to stop thinking about all this

and as a result Chrome slams SSD by writing cached Youtube videos to disk .... except Youtube never reuses cached video data (not even when rewinding more than couple minutes to already watched spot in same video), it explicitly generates hashed requests with custom URL parameters googlevideo.com/videoplayback?expire (~6hour shelf life) &range &sig &lsig. Heavy YT viewing results in wearing out your SSD by tens of gigabytes per day for no particular reason. This is just one small example of side effects from such brilliant decisions.


Yikes, is that really the case? No wonder the 128GB drive on my 2013 MacBook air wore out to 60% of its original performance...


There was an article by varnish taking about how you should leave the caching and memory management to the OS - even if you can beat the virtual memory manager today you’ll stop improving your home grown solution while RAM and the kernel keep marching on.



My take:

1-13) General background info that informs the rest.

14-25) Important for any programmer that does enough file IO that they need to optimize it.

26-29) Important for any system admin to ensure they aren't inadvertently limiting the performance of their hardware.


Not just programmers. Anyone using ZFS with SSD, whether as the pool itself or in various caches like slog(zil) is going to find this information of use when tuning for better SSD citizenship. Programmers treating SSD like faster spinning rust is like programmers treating S3 like another POSIX filesystem; you can do it, but you're trading away compounding future advantages for that one moment of expedience.


In my career I have found that file system tuning for the devices an anti-pattern that almost always ends up causing more problems than it's worth.


Are you writing low-level software, such as filesystems, or raw block backed database storage engines? If not, then that's definitely a decent maxim to live by.


Don't your distributed systems use databases of some sort?


And why does a DB user need to know those details? Isn't it the whole point of DB systems to provide an optimized solution that allows users to focus on other things?


Databases always try to flush something to disk after transaction, just in case unexpected reboot happens. So your writes to db have direct correlation to disk writes.

Choice of db schema impacts physical layout on ssd. E.g. Different tables are more likely to be on different ssd pages resulting in random writes.

Databases are insanely complex, but not magic.


By the looks of the article? People writing SSD firmware, or SSD drivers.

There is probably a small but non-zero number of these on here.


The author appears to be an EM at Booking.com. It seems unlikely that anyone at Booking would be working on SSD firmware or drivers, but a CDN seems like a reasonable assumption and also a useful place to plumb the depths of SSD implementations.


This guy used to hammer a good point about databases:

"In a time of SSD, multi-core/processor, two terabyte memory and Optane App Direct Mode machines, there is no reason not to build from BCNF data. Time to do what Dr. Codd demonstrated. Technology has finally caught up with the maths."

https://drcoddwasright.blogspot.com (skip the distractions)


I treat ssds like faster hard drives and I have never been disappointed.


Well, there's still flashbench:

https://github.com/bradfa/flashbench

Plus, alternatively, there's FlashBench:

https://github.com/JonghyeokPark/FlashBench

These might be found useful for determining the underlying structure.


Personally I feel like files are an abstraction that are too low-level for your typical new programmer. I find it odd that a typical script use case that you learn in Python 101 is reading a bunch of junk from a file and then write into another file. Files are finicky and we have much better abstractions than them, like databases.


Since code is typically stored in files, I hope that new programmers would be expected to learn what a file is.


Look at the file system as at a key-value store, only tree-shaped.


A key value store has simple CRUD primitives, does a file system?


I'd say pretty much yes. If the file name is the "key" then most languages have straightforward ways to create and delete files, read data from them, and write data to them.


Not really.. try editing a giant file or reading a file with an odd encoding


How relevant is this in 2022? What's changed and what still applies?


A serious question: What has changed?


Geometry got smaller, thus wear endurance got a LOT worse.


No, 3D NAND helped a lot for durability.


3D QLC NAND, which is what all cheap consumer SSDs are transitioning to, is pretty bad, like 1/10th the durability of common 3D TLC NAND from 3 years ago. And 1/100th the durability of even non 3d MLC NAND from 2014.

Enterprise class 3D TLC NAND is relatively close to enterprise class non 3D MLC NAND, the gap is bigger for consumer drives.

But I think as of 2022 only Apple still sells consumer desktops/laptops with entirely TLC NAND. Everyone else is racing to the bottom for their consumer stuff.


Is anyone aware of a book-length equivalent of this?


Doubt it, books on niche technical subjects don’t seem to be much of a thing anymore unless you’re willing to pay extortionist prices for university textbooks.


There is a book on DRAM, caches and hard drives by Bruce Jacobs.

Basically I want what every programmer should know about storage but in the style of dreppers original article.


> books on niche technical subjects don’t seem to be much of a thing anymore

Why not? Blog posts aren't nearly as valuable.


I assume someone would be writing that book in the hope they'd make money back and that's hard to do with a super niche subject few will be interested in and even fewer would be willing to pay for.


How does that differ from 10 years ago?


I would assume smaller and smaller niches and the information is now easier to find online.


> Blog posts aren't nearly as valuable.

Also since we’re talking about hardware, I imagine a lot of people with necessary domain knowledge can’t share what they’ve learned done because of IP restrictions.


That was true in the past, when books were, if the GGP is accurate, more common.


If you want to know about some internals of SSD, the only book I know is "Inside Solid State Drives (SSDs)" by G. Wong. Its an old book though.


are speeds of bleeding edge mem devices getting close to RAM?


Not really.

Max throughput is around 6gbps with a fairly high latency. DDR5 has speeds of 52gbps, lower latency, AND your CPU will almost undoubtedly have a cache on it to increase that speed further.

This is all assuming you are putting your mem device on a pci-express bus.


> Max throughput is around 6gbps with a fairly high latency.

In the consumer market, a number of performance NVMe drives will hit over 5GB/sec, which would be 40 Gbps.

The latency isn't anywhere near as good as even quite-old RAM, but modern SSDs are considerably less than an order magnitude off in transfer speed from even current, common ram (DDR4) and "only" about a hundred times higher in latency than RAM.

That's pretty stunning from mass storage. So is well over 500,000 IOPS.


GP uses wrong unit for both, both GiB/s.


Your forgetting Optane DIMMs for enterprise.


In terms of bandwidth or latency? All conditions, worst case, best case?


What you should know is that I had an Apple OEM 1TB SSD in my late-2013 MBP and one day it failed so catastrophically under normal conditions that 2 of the best data recovery teams in the world told me there was nothing they could do.

Backup your stuff


From my experience, SSDs tend to just disappear from the bus when they're done. If there's JTAG pins, maybe it's OEM recoverable, but good luck. At least with spinning disks, they usually have a media failure which often has warning signs. Bearing failures are usually seized at startup and there are ways to get them moving and then do a full dump. If the electronics fail, often you can pull a board from a working unit and attach it to the media and get good results. I don't think it's reasonable to swap flash chips onto another board (but maybe, I dunno?).


Get an 8TB backup drive (Costco has them really cheap), and run Macrium Reflect to clone your HDD onto the backup drive. Macrium Reflect makes use of Volume Shadow Copy, so you can continue using your computer while it's backing things up.

Those big backup HDDs use shingled storage, so they're not any good as general purpose hard drives, but they're excellent for strictly sequential writes, such as a full disk backup to a single file.


Pair that with an online/remote backup and you're all set. I like Backblaze because the software client is very good but you could just as well push your own encrypted backup to S3 or a VPS.


You can also use BackBlaze B2 to push your own backups with whatever software will support it, similarly to how you'd use S3.


I'll admit my memories of 2013 are hazy, but I do recall TRIM being an issue early in the Macbook's history†.

Backup your stuff! I happen to also back up to an SSD these days, because the difference between minutes and hours is hard to argue with.

†edit: history of shipping with an SSD standard, that is.


> because the difference between minutes and hours is hard to argue with.

If the backups are incremental it shouldn’t take hours.


For a given backup an SSD will be much faster, less susceptible to drop and vibration damage, and pocketable where a portable hard drive is pouchable at best.


Incremental backups are slightly higher risk.


Wow, you didn’t have a backup routine. That’s so basic. Why not?

-

Oh, what my routine is? Uh. I `cp -a ~ /mnt/backup/date` a couple of times a month.

... Testing backups?


Speaking about backing up...if one were interested in long term archiving, do magnetic platters offer longer lasting data integrity than SSDs in cold storage?


>..do magnetic platters offer longer lasting data integrity than SSDs in cold storage?

Yes. With an SSD the enemy is electron leakage. Minute quantities of electrons trying to escape an unnatural state and return to equilibrium. (yes, I just anthropomorphized electrons.) Magnets however are more stable by nature. (yes there is nothing natural about hard-drive storage. SMR doubly so!)

Anecdote/anecdata: I have been able to retrieve full drives worth of data off of drives that have sat in a cardboard box for 10 years. I also have trouble accessing data on 1-year old USB flash drives.


The JEDEC standard specifies client SSDs have to retain data powered off for a year under worst case temperature. Enterprise drives have a relaxed requirement for three months. This is because lower programming voltages are used to achieve higher total bytes written endurance.

Even hard disks should be powered on occasionally to test backups.


In general I trust the older tech more than newer for long-term archiving. So that would mean HDD (the oldest tech thereof you can find still sold, probably) or tape or DVD over SSD.

But multiple copies in multiple formats cannot hurt, and the most important stuff should have multiple live copies.


it really depends on the format. pressed DVDs will outlast your VHS tapes


I've been coming around to the POV that "cold storage" is a bad idea and it's best to keep everything hot. It's been discussed on 2.5admins.com a lot.


Not sure about that but I do know that the new sealed helium filled drives are much harder to take apart and do backup recovery on




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: