NAND Error-correction Code

st_goliath · on Nov 22, 2023

A small remark for the casual readers: This is not about SSDs, this is for raw NAND flash access.

On an embedded device, you might have a piece of NAND flash directly attached to the CPU e.g. via a parallel bus or SPI. Usually with a controller somewhere in between, or integrated in the SoCs.

In Linux, there is a subsystem called MTD (memory technology devices) for dealing with not-quite-block storage that provides a thin hardware abstraction layer with an internal driver framework for NAND flash and backends for all sorts of controllers.

Some particularly cheap flash chips/controller do not support hardware bit-error correction. There are IIRC 2 software engines that a driver can use as a fallback: hamming and BCH[1] based (the former is documented here).

The MTD subsystem is really just a thin abstraction layer tough and provides a uniform API for e.g. page read/write & block erase for the flash chip it talks to. There is another subsystem called UBI (Unsorted Block Images) that can be stacked on top of MTD. It takes care of wear-leveling, bad block management and implements LVM-style logical volume partitioning. It also does not emulate a block device, but acts more like a flash device with idealized properties.

There are 2 filesystems in Linux which are designed to deal with the weirdness of those devices: the older JFFS2, which stacks directly on top of MTD, and UBIFS on top of UBI.

[1] https://en.wikipedia.org/wiki/BCH_code

dmitrygr · on Nov 22, 2023

> There are 2 filesystems in Linux which are designed to deal with the weirdness of those devices

You forgot yaffs2 which was commonly used in the early days of android and runs on top of MTD

Deeg9rie9usi · on Nov 22, 2023

yaffs2 was never upstream.

anthk · on Nov 22, 2023

This was used in the PocketCHIP for instance. They changed SLC to MLC and they error ratio got much smaller. AKA shit happened if you let the device shutdown by hardware due to low battery usage, you would loss lots of files.

vlovich123 · on Nov 22, 2023

It’s interesting to me that Linux doesn’t have support for putting ext/xfs/… on top of MTD/UBI devices (i.e. making wear leveling more independent). I imagine many people installing on normal SSDs with built-in controllers use XFS. Not sure why we don’t see more raw NAND w/ a basic PCI-e controller with Linux doing all the wear leveling & whatnot given the problems vendors have flushing. You could even have an attached M3 that Linux could load companion software on for accelerating performance to make it even more on par with traditional SSDs

duskwuff · on Nov 22, 2023

> It’s interesting to me that Linux doesn’t have support for putting ext/xfs/… on top of MTD/UBI devices

Raw NAND has a number of characteristics which are incredibly unpleasant for a traditional filesystem to deal with, including:

1) No small writes. You usually have to program a whole page at a time, which is often larger than a sector on a hard disk (~16 KiB). Filesystems which are accustomed to being able to flush single-byte writes to disk will be very disappointed by this.

2) No overwriting. Once a page has been written to, you can't write to it again until it's been erased. This makes even emulating small writes difficult.

3) Erase is a big hammer. It operates on a large (32 - 128 MiB) group of pages and erases them all at once.

4) Bad sectors. Sometimes an erase block is just bad on chip and you have to work around it. And, as flash wears out, blocks can go bad, and you'll just have to hope there's a spare available.

Bottom line is -- if you want to write a filesystem that works with flash directly, you have to write it to work that way from the start. Modifying a traditional filesystem like ext4 to run on flash would effectively be a complete rewrite.

vlovich123 · on Nov 22, 2023

> Modifying a traditional filesystem like ext4 to run on flash would effectively be a complete rewrite.

I mean that’s patently incorrect though… All modern PCs are running on raw NAND with a flash controller mediating things to present as a legacy block device and ext/xfs/nags etc runs just fine this way. Heck, even Apple OSes, which actually afaik do have their own in-house flash controller, had HFS running on top of the translation layer rather than baking those requirements into the FS (I think this is also true for APFS but not 100% certain).

So while everything you wrote about the complexities of raw NAND are true, it doesn’t follow that the filesystem MUST handle all that complexity itself. Something designed to take full advantage might perform better, but that’s always true and seeing cheaper storage solutions with reliable fsync might be a worthwhile trade off (+ you can always run a filesystem that’s better designed to accommodate the underlying HW details)

IAmLiterallyAB · on Nov 22, 2023

This is true, Linux developers could write their own fully featured FTL. However, it's difficult. To quote the MTD docs

"Unfortunately it is a rather difficult task to create a good FTL layer and nobody still managed to implement one for Linux."

SkeuomorphicBee · on Nov 22, 2023

Out of curiosity, are you sure about #1 ? I all because a long long time ago I worked with some raw NAND on embedded devices and I remember doing small writes, in fact I could do single byte writes (write only, the erase was in big blocks as you said).

Deeg9rie9usi · on Nov 22, 2023

On NAND there is the NOP property (Number of programs). For SLC it is usually 4. So, you can program a single page in up to 4 steps on SLC NAND. If you write less than that, you're wasting space because you need to jump to the next (sub)page. Re-programming is not allowed on NAND.

duskwuff · on Nov 22, 2023

Partial page programming was supported on some older SLC NAND, but is usually unsupported on modern MLC(/TLC/QLC) flash.

bradfa · on Nov 22, 2023

You can use a normal file system on top of UBI. There's some rather large caveats, however: http://www.linux-mtd.infradead.org/doc/ubi.html#L_ubiblock

gdgghhhhh · on Nov 22, 2023

To support a block based filesystem on top of MTD an advanced FTL (Flash Translation Layer) is needed. Sadly most algorithms for decent FTLs are patented. SSD vendors try hard keeping their magic sauce there own. In Linux we habe some trivial FTLs implemented, but even these are toxic.

vlovich123 · on Nov 22, 2023

I suspect that while vendors would like you to believe it’s all patented secret sauce, I’m more skeptical. There’s too many vendors and they all perform roughly similarly with the controller chip speed and bus speed being more impactful. The FTL might be more impactful for disk longevity but I wouldn’t put much stock in consumer device numbers here. Even if Linux starts with an initially worse implementation, over time it should succeed because of cheaper costs multiplied by in aggregate a greater investment from stakeholders than any single private stakeholder could match. There’s a reason Apple and Amazon have written their own controllers and it’s largely because vendor solutions actually suck quite a bit when you start digging in.

Deeg9rie9usi · on Nov 22, 2023

And neither Apple nor Google published their FTLs...

garaetjjte · on Nov 22, 2023

I think it is possible to create emulated block device stacked on UBI.

o11c · on Nov 22, 2023

That said, it should be noted that there are still USB flash drives in the wild that do no error detection. It's a pity that `badblocks` is so poorly supported these days ...

supertrope · on Nov 22, 2023

I thought cheap flash devices are only usable products because of error correction. USB flash drives and memory cards use reject tier NAND and compensate by pre-marking bad sectors and heavy ECC. Your files exist as a probabilistic approximation.

o11c · on Nov 22, 2023

I suppose it's possible that they simply do not enough error corruption, but given that I don't get IO errors, only wrong data, I doubt it.

Typical algorithms work like:

* 1 bad bit is corrected

* 2 bad bits produce an error. Since there's nothing else meaningful to do, this can only propagate to the higher layer.

* 3 or more bad bits may produce wrong data or an error

so we wouldn't expect to hit case 3 without hitting case 2.

mschuster91 · on Nov 22, 2023

> On a test with a Linksys NSLU2 (ARMv5TE processor) the speedup was a factor 5 (big endian mode, gcc 4.1.2, -O3)

Now that's a name I hadn't heard in a long long time.

The post is from 2008 anyway, I wonder what would happen if one were to put this code (or the earlier, non-unrolled-loop versions) to the test with modern compilers that can do all sorts of black magic for performance, or if it could be extended to take advantage of modern CPU features.

mastax · on Nov 22, 2023

https://godbolt.org/z/szMvnTT63

Looking at the naive code from Analysis 1 and comparing GCC 4.1.2 to GCC 13.2 there's a huge difference. Even when forcing both to generate code for the same old architecture (k8) so its more fair.

GCC 4.1 generates code which is a pretty direct translation of the source: branchy, operates on bytes.

GCC 13.2 does a ton of optimization. There's only a single jump/loop. It uses prefetch. It uses SSE instructions to operate on 128-bits at a time. It looks like all the instructions are duplicated as well to operate on a pair of registers, too. I'm not exactly sure.

ack_complete · on Nov 22, 2023

GCC is just vectorizing the code literally. It's doing multiple bytes in each loop iteration, but the vectorized code is still bit-testing the index to conditionally XOR into each accumulator. This is still much less efficient than taking advantage of the regular patterns as in the latter parts of the article. It also runs with a lot less lanes than it could, because it uses 32-bit lanes instead of 8-bit for some reason.

jdblair · on Nov 22, 2023

I encountered NAND ECC about 10 years ago. We wanted to increase the size of the NAND flash on our embedded device (a data logger and gateway used to monitor and control solar power systems). The larger NAND parts didn't have built in error correction. Enabling software error correction meant we had to move the kernel onto a small SPI-flash part.

This was all pretty straightforward, and probably increased reliability, but it introduced new complexity into our build, install and upgrade process, because now I had to take into account the different flash layout in otherwise almost identical devices.

Since then I learned it's much easier to use eMMC. You don't need software block management and wear leveling, you just read the specs and let the manufacturer handle everything for you.

loeg · on Nov 22, 2023

(2008) (Note the reference to gcc 4.2.)

pedrocr · on Nov 22, 2023

Is there an NVMe manufacturer that sells their product as just naked flash? The first rule of hardware is that hardware providers don't know how to write software. It would be nice if we could get rid of the proprietary and buggy embedded operating systems running our storage as an opaque abstraction layer. Things like lying about flush that was recently on HN would stop being a problem.

cesarb · on Nov 22, 2023

While not going as far as you want, one movement in that direction is NVMe Zoned Namespaces (https://www.anandtech.com/show/15959/nvme-zoned-namespaces-e...).

loeg · on Nov 22, 2023

Flexible Data Placement is another interesting development in the NVMe space.

hlandau · on Nov 22, 2023

These are called "open-channel SSDs", which seem to currently mainly be consigned to research purposes AFAICT. But some do exist.

https://en.wikipedia.org/wiki/Open-Channel_SSD

BitBanging · on Nov 22, 2023

The NAND flash manufacturers usually have no software involvement unless they are in the business of also selling SSD controller SoCs. They will typically sell FDKs/SDKs that are heaping piles of garbage written overseas for their SoCs that provide minimum functionality for interfacing with all of the components.

I wouldn't really expect controller-less NAND flash to ever meaningfully compete with what we have now. There is a lot more specialized hardware in a typical SoC than just a microprocessor and NAND flash chips

acquacow · on Nov 22, 2023

Well, by definition, naked flash wouldn't be NVMe... so no.

FirmwareBurner · on Nov 22, 2023

It would just be flash

o11c · on Nov 22, 2023

I expected discussion of __builtin_parity and cmov at least.

xxpor · on Nov 22, 2023

This article references gcc 4.1, __builtin_parity is likely newer than that.

thomasmg · on Nov 22, 2023

Me too. Reading the article, there seems to be no attempt to replace the two remaining "if" statements, and so, avoid branches. It doesn't even need to be CMOV: possibly introducing tmppar1 and tmppar2 = tmppar & (an operation that is either 0 or -1). And then always rp12 ^= tmppar1; rp14 ^= tmppar2; -- it might not be faster, but that's what I would try at least.

phkahler · on Nov 22, 2023

One more... Any benefit would be CPU specific:

if ((i & 0x1) == 0) rp12 ^= tmppar;

Seems like it's ripe for something like:

rp12 ^= tmppar & ~(0 - (i & 0x1));

If my logic is functioning ;-) this eliminates the branch, but many modern CPUs running linux probably have a conditional mov which should be faster with the straightforward version.

Edit: Except RISC-V, there are no conditionals other than branch. Maybe this will help on that arch.

reqo · on Nov 22, 2023

No directly related, but I wish there was a 'standard' linux documentation that look more like Python docs.

corbet · on Nov 22, 2023

The kernel docs are generated with Sphinx, just like the Python docs are. Are you saying you wish we used the Python theme?

reqo · on Nov 22, 2023

No, I was referring to the structure of Python docs. More specifically, I want there to be something like the 'Library Reference'. It starts from the basics and moves to more abstract/advanced topics. I guess the current docs are optimized for many different use cases, but as a beginner, it is very hard for me to gen an 'Overview' of linux from the official docs. There should better categorization IMO!

billfruit · on Nov 22, 2023

There is the Linux Programming Environment book, which covers most of the library.