I feel like maybe this is "what all filesystem developers should now about solid-state drivers"; not very obvious how most other developers would interact with a device at the level of abstraction where they have the kind of necessary control.
If some typical write pattern from a typical app is wearing out the SSD really fast, I'd say that's the SSD firmware engineer's problem? And I think they've actually done a great job in general, judging by the typical lifespan of SSDs and the typically great performance. I'd argue that if the drive is designed correctly, most programmers shouldn't have to care about low level details. (I did say MOST).
I think you misspelled "it's the user's problem". I don't think most companies care until it becomes something that materially affects them. Until then, users are reliant on the developers of the applications they use to make up for the deficiencies in lower layers.
> A reputation for drives that fail faster than their competitors
How can they get that if they stuff enough fake reviews, plus the legion of consumers who would have no idea that the drive was the issue and not "viruses".
True mostly! OEMs are the largest non-enterprise market, and they do (roughly) the same types of testing and validation a enterprise customer would though.
No one is going to be selling a million laptops with a drive from RandoDriveManuGoodBrand off Amazon with no track record and no validation.
Anyone buying the non-name brand types of drives knows they're getting (at best) something that might only work a little while before exploding.
The name brands like Samsung, et. al. work hard to make their firmware not grenade something (and the drives overall to be AT LEAST as reliable as their competitors) BECAUSE they want the name to mean something. It is what drives customers their way, most of the time.
If they get a reputation as a company selling junk (cough Deskstar/Deathstar) that costs them billions over many years.
The firmware's job is wear leveling - making sure all the sectors wear out at about the same time, which they do a great job at. But SSDs can write so fast that you can burn out the drive in months (maybe even weeks?) if you wanted to. There's nothing the firmware can do fix the limitations of flash itself. The important thing to keep in mind is that for write heavy workloads, you need to keep write amplification in mind.
I remember an adjacent team to mine that had to store several gigs of data which changed often, but only a small percent changed at any one time. They needed to recover quickly from a crash so they wrote it to disk. But they wrote the entire data set out to disk after every update, instead of keeping it in e.g. rocksdb or even sqlite. Their entire fleet burnt through their SSDs at about the same rate, so machines were dying in rapid succession, ouch. Write amplification is a real problem, but SSDs great performance often masks it being an issue until down the road.
> But SSDs can write so fast that you can burn out the drive in months (maybe even weeks?)
You can burn out a modern consumer drive in 2 days if you want to. Write perf ~6 gb/s, mtbf 700 tb written on a 1 tb drive. The tlc/qlc cells have very poor endurance imo.
I am surprised the article doesn't mention monitoring TBW (TerraBytesWritten). I found it a good indicator for how much data is actually written to the SSD. In my case, I decided to buy cheap consumer SSDs (from WD), because I calculated I have only about 50TBW per year on my VM-drive, a ZFS-Mirror. In reality, it is even less - so far 14TBW in 2022, with 16 VMs. See a blog post here for how I monitor the stats in InfluxDB [1]. WD says the drive has an average lifetime of about 300 to 400 TBW, so I can expect at least another 5 years.
> Cells are grouped into a grid, called a block, and blocks are grouped into planes. The smallest unit through which a block can be read or written is a page. Pages cannot be erased individually, only whole blocks can be erased. The size of a NAND-flash page size can vary, and most drive have pages of size 2 KB, 4 KB, 8 KB or 16 KB. Most SSDs have blocks of 128 or 256 pages, which means that the size of a block can vary between 256 KB and 4 MB. For example, the Samsung SSD 840 EVO has blocks of size 2048 KB, and each block contains 256 pages of 8 KB each.
Very confusing and might be incorrect. What are planes. And are pages made out of blocks or vice-versa? If blocks are grouped in pages, with erasing it sounds very different.. Only whole blocks, which sounds like blocks are bigger than pages.
Planes reflect the physical structure of the storage chips: there's multiple layers that share a common vertical bus.
Plane > Block > Page, that is to say Blocks are always made up of multiple pages (commonly 128 or 256 as the quote mentions). Pages are the unit of read and write, while blocks are the unit of erasure. The FTL tries to hide this page write vs block erase mismatch as best it can, but as the original article points out you may need to be aware of what it's doing in very high performance systems.
A single NAND die is only divided into two or four planes. It's a function of how many copies of the peripheral circuitry for accessing the array are included, not how many layers are in the 3D NAND array. More planes means the die can do more things in parallel (subject to constraints).
A drive with 8 dies each having 512Gbit capacity divided into four planes per die will perform almost as well as one with 16 dies of 256Gbit divided into two planes, other things being equal (eg. number and speed of the channels between the SSD controller and the NAND, page and block sizes and access times, all of which are subject to change at the same time a generational change increases die capacity and number of planes).
> Splitting cold and hot data as much as possible into separate pages will make the job of the garbage collector easier.
How do I tell my SSD to write stuff to specific pages? You can't really tell the SSD to do anything except read, write, or trim LBAs.
Does NVMe support this with its queues?
> 27. Over-provisioning is useful for wear leveling and performance
I thought most if not all SSDs were already overprovisioned. Does additional overprovisioning help?
> To ensure that logical writes are truly aligned to the physical memory, you must align the partition to the NAND-flash page size of the drive.
I think this is false. This assumes there is a one-to-one mapping of LBA to SSD PBA which you don't know. LBA 2048 could go to any PBA on any page/block/flash line in the unit and as things are written and rewritten, any correspondence that might happen due to sequential assignment of PBAs->LBAs would gradually diminish, IF you knew for sure that was happening in the first place. Because you wouldn't really know what the SSD is doing without reverse engineering or seeing the source code of firmware, unless there's things going on in NVMe land that are new and I don't yet know.
I wrote a series of articles that covered the new features defined for NVMe drives. The general pattern is that there are now lots of optional hints that drives and host systems can exchange about data placement, alignment and lifetime. But there are also alternative paradigms available like Zoned Storage that break compatibility to offer explicit control. These features are mostly only implemented in enterprise SSDs, and often only if a big customer specifically asks for them.
I've been thinking about the possibility of "dumb" SSD devices.
All of the current HW-level performance hacks could actually get in the way if your software already enforces things like single writer, chunky writes and/or append-only log structures.
Give me a drive that only writes in 1 linear direction (until its full) and has a big red button to clean the entire thing all at once (which would clearly require some offline processing time & multiple disks for a realistic system).
From a low level programmatic standpoint, managing size and alignment
with (potentially unknown) page sizes throws the same challenges as
for AV buffers and network packet MTU/sizes - either side of "just
right" is suboptimal.
> In December 2012, Taiwanese engineers from Macronix revealed their intention to announce at the 2012 IEEE International Electron Devices Meeting that they had figured out how to improve NAND flash storage read/write cycles from 10,000 to 100 million cycles using a "self-healing" process that used a flash chip with "onboard heaters that could anneal small groups of memory cells."
So can I apply this myself by placing an SSD drive in an oven?
Yes, if you have manufacturer software to factory format blank drives. Heating up heals cells being written (probably filled as writing empties cells while erasing stores max charge value), but also speeds up data degradation in all the other cells not being written to.
What sorts of programmers should be concerned about these matters? Page cache doesn't seem too important or interesting in my day to day app and distributed systems development.
Maybe it's useful if you want to make something like a more performant version of grep? (aka ripgrep?)
I would argue that in most cases you "don't need to know anything about it" either. It's reasonable to deliberately treat abstractions as if they are not leaky, as long as you're aware that all abstractions in fact are leaky and you're equipped to investigate and learn about them if the leaks cause problems.
> It’s not like reading 10 bullet points on the subject is “diving deep” and making huge time investment.
True, but you're using so many abstractions that the rule can't feasibly be "read a short summary of every abstraction you're using." There are just too many. At some point you have to choose a threshold where the likelihood of an abstraction leakage is sufficiently low. When you're debugging a CSS selector you will almost certainly never need to know about even the existence of, say, Fermi–Dirac statistics.
One topic at high level (like in the article, 10-20 minutes?) per week, results in ~50 topics per year.
Not sure how many computer related topics you know/want (“The more you know, the more you know you don't know”), but for me, 50 topics on programming seems sufficiently high at frankly a very low effort/commitment.
People who read from disks and people who write to them. How SSDs organize data definitely had read and write performance implications and if you're writing to disk, some write habits that are perfectly reasonable on regular disks can cause catastrophically fast wear on SSDs.
Yes, but the number of people who need to be worried about aligning their writes and such is pretty small; certainly not "every" programmer. The author gets into the weeds about certain things application level programmers almost never need to know or concern themselves about. He really doesn't understand what's useful information and what isn't.
If you're programming at enterprise scale, this sort of stuff is the responsibility of architect-level programmers and senior systems engineers.
Even most linux sysadmins know all about block alignment (well, if they predate most of the various tools figuring out block size/alignment stuff for you.) It's nothing new - RAID arrays work best when properly aligned, for example.
> doesn't seem too important or interesting in my day to day app and distributed systems development.
Makes sense to me. At Google we were told to stop thinking about all this stuff, that the storage hardware and software people were responsible for hiding things like wearout from application developers. This article is really "things you should know if you plan to directly access an NVMe device" but there is a huge class of programmers who are better off not knowing.
>At Google we were told to stop thinking about all this
and as a result Chrome slams SSD by writing cached Youtube videos to disk .... except Youtube never reuses cached video data (not even when rewinding more than couple minutes to already watched spot in same video), it explicitly generates hashed requests with custom URL parameters googlevideo.com/videoplayback?expire (~6hour shelf life) &range &sig &lsig. Heavy YT viewing results in wearing out your SSD by tens of gigabytes per day for no particular reason. This is just one small example of side effects from such brilliant decisions.
There was an article by varnish taking about how you should leave the caching and memory management to the OS - even if you can beat the virtual memory manager today you’ll stop improving your home grown solution while RAM and the kernel keep marching on.
Not just programmers. Anyone using ZFS with SSD, whether as the pool itself or in various caches like slog(zil) is going to find this information of use when tuning for better SSD citizenship. Programmers treating SSD like faster spinning rust is like programmers treating S3 like another POSIX filesystem; you can do it, but you're trading away compounding future advantages for that one moment of expedience.
Are you writing low-level software, such as filesystems, or raw block backed database storage engines? If not, then that's definitely a decent maxim to live by.
And why does a DB user need to know those details? Isn't it the whole point of DB systems to provide an optimized solution that allows users to focus on other things?
Databases always try to flush something to disk after transaction, just in case unexpected reboot happens. So your writes to db have direct correlation to disk writes.
Choice of db schema impacts physical layout on ssd. E.g. Different tables are more likely to be on different ssd pages resulting in random writes.
The author appears to be an EM at Booking.com. It seems unlikely that anyone at Booking would be working on SSD firmware or drivers, but a CDN seems like a reasonable assumption and also a useful place to plumb the depths of SSD implementations.
This guy used to hammer a good point about databases:
"In a time of SSD, multi-core/processor, two terabyte memory and Optane App Direct Mode machines, there is no reason not to build from BCNF data. Time to do what Dr. Codd demonstrated. Technology has finally caught up with the maths."
Personally I feel like files are an abstraction that are too low-level for your typical new programmer. I find it odd that a typical script use case that you learn in Python 101 is reading a bunch of junk from a file and then write into another file. Files are finicky and we have much better abstractions than them, like databases.
I'd say pretty much yes. If the file name is the "key" then most languages have straightforward ways to create and delete files, read data from them, and write data to them.
3D QLC NAND, which is what all cheap consumer SSDs are transitioning to, is pretty bad, like 1/10th the durability of common 3D TLC NAND from 3 years ago. And 1/100th the durability of even non 3d MLC NAND from 2014.
Enterprise class 3D TLC NAND is relatively close to enterprise class non 3D MLC NAND, the gap is bigger for consumer drives.
But I think as of 2022 only Apple still sells consumer desktops/laptops with entirely TLC NAND. Everyone else is racing to the bottom for their consumer stuff.
Doubt it, books on niche technical subjects don’t seem to be much of a thing anymore unless you’re willing to pay extortionist prices for university textbooks.
I assume someone would be writing that book in the hope they'd make money back and that's hard to do with a super niche subject few will be interested in and even fewer would be willing to pay for.
Also since we’re talking about hardware, I imagine a lot of people with necessary domain knowledge can’t share what they’ve learned done because of IP restrictions.
Max throughput is around 6gbps with a fairly high latency. DDR5 has speeds of 52gbps, lower latency, AND your CPU will almost undoubtedly have a cache on it to increase that speed further.
This is all assuming you are putting your mem device on a pci-express bus.
> Max throughput is around 6gbps with a fairly high latency.
In the consumer market, a number of performance NVMe drives will hit over 5GB/sec, which would be 40 Gbps.
The latency isn't anywhere near as good as even quite-old RAM, but modern SSDs are considerably less than an order magnitude off in transfer speed from even current, common ram (DDR4) and "only" about a hundred times higher in latency than RAM.
That's pretty stunning from mass storage. So is well over 500,000 IOPS.
What you should know is that I had an Apple OEM 1TB SSD in my late-2013 MBP and one day it failed so catastrophically under normal conditions that 2 of the best data recovery teams in the world told me there was nothing they could do.
From my experience, SSDs tend to just disappear from the bus when they're done. If there's JTAG pins, maybe it's OEM recoverable, but good luck. At least with spinning disks, they usually have a media failure which often has warning signs. Bearing failures are usually seized at startup and there are ways to get them moving and then do a full dump. If the electronics fail, often you can pull a board from a working unit and attach it to the media and get good results. I don't think it's reasonable to swap flash chips onto another board (but maybe, I dunno?).
Get an 8TB backup drive (Costco has them really cheap), and run Macrium Reflect to clone your HDD onto the backup drive. Macrium Reflect makes use of Volume Shadow Copy, so you can continue using your computer while it's backing things up.
Those big backup HDDs use shingled storage, so they're not any good as general purpose hard drives, but they're excellent for strictly sequential writes, such as a full disk backup to a single file.
Pair that with an online/remote backup and you're all set. I like Backblaze because the software client is very good but you could just as well push your own encrypted backup to S3 or a VPS.
For a given backup an SSD will be much faster, less susceptible to drop and vibration damage, and pocketable where a portable hard drive is pouchable at best.
Speaking about backing up...if one were interested in long term archiving, do magnetic platters offer longer lasting data integrity than SSDs in cold storage?
>..do magnetic platters offer longer lasting data integrity than SSDs in cold storage?
Yes. With an SSD the enemy is electron leakage. Minute quantities of electrons trying to escape an unnatural state and return to equilibrium. (yes, I just anthropomorphized electrons.) Magnets however are more stable by nature. (yes there is nothing natural about hard-drive storage. SMR doubly so!)
Anecdote/anecdata: I have been able to retrieve full drives worth of data off of drives that have sat in a cardboard box for 10 years. I also have trouble accessing data on 1-year old USB flash drives.
The JEDEC standard specifies client SSDs have to retain data powered off for a year under worst case temperature. Enterprise drives have a relaxed requirement for three months. This is because lower programming voltages are used to achieve higher total bytes written endurance.
Even hard disks should be powered on occasionally to test backups.
In general I trust the older tech more than newer for long-term archiving. So that would mean HDD (the oldest tech thereof you can find still sold, probably) or tape or DVD over SSD.
But multiple copies in multiple formats cannot hurt, and the most important stuff should have multiple live copies.