Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
NAND Error-correction Code (kernel.org)
93 points by signa11 on Nov 22, 2023 | hide | past | favorite | 39 comments


A small remark for the casual readers: This is not about SSDs, this is for raw NAND flash access.

On an embedded device, you might have a piece of NAND flash directly attached to the CPU e.g. via a parallel bus or SPI. Usually with a controller somewhere in between, or integrated in the SoCs.

In Linux, there is a subsystem called MTD (memory technology devices) for dealing with not-quite-block storage that provides a thin hardware abstraction layer with an internal driver framework for NAND flash and backends for all sorts of controllers.

Some particularly cheap flash chips/controller do not support hardware bit-error correction. There are IIRC 2 software engines that a driver can use as a fallback: hamming and BCH[1] based (the former is documented here).

The MTD subsystem is really just a thin abstraction layer tough and provides a uniform API for e.g. page read/write & block erase for the flash chip it talks to. There is another subsystem called UBI (Unsorted Block Images) that can be stacked on top of MTD. It takes care of wear-leveling, bad block management and implements LVM-style logical volume partitioning. It also does not emulate a block device, but acts more like a flash device with idealized properties.

There are 2 filesystems in Linux which are designed to deal with the weirdness of those devices: the older JFFS2, which stacks directly on top of MTD, and UBIFS on top of UBI.

[1] https://en.wikipedia.org/wiki/BCH_code


> There are 2 filesystems in Linux which are designed to deal with the weirdness of those devices

You forgot yaffs2 which was commonly used in the early days of android and runs on top of MTD


yaffs2 was never upstream.


This was used in the PocketCHIP for instance. They changed SLC to MLC and they error ratio got much smaller. AKA shit happened if you let the device shutdown by hardware due to low battery usage, you would loss lots of files.


It’s interesting to me that Linux doesn’t have support for putting ext/xfs/… on top of MTD/UBI devices (i.e. making wear leveling more independent). I imagine many people installing on normal SSDs with built-in controllers use XFS. Not sure why we don’t see more raw NAND w/ a basic PCI-e controller with Linux doing all the wear leveling & whatnot given the problems vendors have flushing. You could even have an attached M3 that Linux could load companion software on for accelerating performance to make it even more on par with traditional SSDs


> It’s interesting to me that Linux doesn’t have support for putting ext/xfs/… on top of MTD/UBI devices

Raw NAND has a number of characteristics which are incredibly unpleasant for a traditional filesystem to deal with, including:

1) No small writes. You usually have to program a whole page at a time, which is often larger than a sector on a hard disk (~16 KiB). Filesystems which are accustomed to being able to flush single-byte writes to disk will be very disappointed by this.

2) No overwriting. Once a page has been written to, you can't write to it again until it's been erased. This makes even emulating small writes difficult.

3) Erase is a big hammer. It operates on a large (32 - 128 MiB) group of pages and erases them all at once.

4) Bad sectors. Sometimes an erase block is just bad on chip and you have to work around it. And, as flash wears out, blocks can go bad, and you'll just have to hope there's a spare available.

Bottom line is -- if you want to write a filesystem that works with flash directly, you have to write it to work that way from the start. Modifying a traditional filesystem like ext4 to run on flash would effectively be a complete rewrite.


> Modifying a traditional filesystem like ext4 to run on flash would effectively be a complete rewrite.

I mean that’s patently incorrect though… All modern PCs are running on raw NAND with a flash controller mediating things to present as a legacy block device and ext/xfs/nags etc runs just fine this way. Heck, even Apple OSes, which actually afaik do have their own in-house flash controller, had HFS running on top of the translation layer rather than baking those requirements into the FS (I think this is also true for APFS but not 100% certain).

So while everything you wrote about the complexities of raw NAND are true, it doesn’t follow that the filesystem MUST handle all that complexity itself. Something designed to take full advantage might perform better, but that’s always true and seeing cheaper storage solutions with reliable fsync might be a worthwhile trade off (+ you can always run a filesystem that’s better designed to accommodate the underlying HW details)


This is true, Linux developers could write their own fully featured FTL. However, it's difficult. To quote the MTD docs

"Unfortunately it is a rather difficult task to create a good FTL layer and nobody still managed to implement one for Linux."


Out of curiosity, are you sure about #1 ? I all because a long long time ago I worked with some raw NAND on embedded devices and I remember doing small writes, in fact I could do single byte writes (write only, the erase was in big blocks as you said).


On NAND there is the NOP property (Number of programs). For SLC it is usually 4. So, you can program a single page in up to 4 steps on SLC NAND. If you write less than that, you're wasting space because you need to jump to the next (sub)page. Re-programming is not allowed on NAND.


Partial page programming was supported on some older SLC NAND, but is usually unsupported on modern MLC(/TLC/QLC) flash.


You can use a normal file system on top of UBI. There's some rather large caveats, however: http://www.linux-mtd.infradead.org/doc/ubi.html#L_ubiblock


To support a block based filesystem on top of MTD an advanced FTL (Flash Translation Layer) is needed. Sadly most algorithms for decent FTLs are patented. SSD vendors try hard keeping their magic sauce there own. In Linux we habe some trivial FTLs implemented, but even these are toxic.


I suspect that while vendors would like you to believe it’s all patented secret sauce, I’m more skeptical. There’s too many vendors and they all perform roughly similarly with the controller chip speed and bus speed being more impactful. The FTL might be more impactful for disk longevity but I wouldn’t put much stock in consumer device numbers here. Even if Linux starts with an initially worse implementation, over time it should succeed because of cheaper costs multiplied by in aggregate a greater investment from stakeholders than any single private stakeholder could match. There’s a reason Apple and Amazon have written their own controllers and it’s largely because vendor solutions actually suck quite a bit when you start digging in.


And neither Apple nor Google published their FTLs...


I think it is possible to create emulated block device stacked on UBI.


That said, it should be noted that there are still USB flash drives in the wild that do no error detection. It's a pity that `badblocks` is so poorly supported these days ...


I thought cheap flash devices are only usable products because of error correction. USB flash drives and memory cards use reject tier NAND and compensate by pre-marking bad sectors and heavy ECC. Your files exist as a probabilistic approximation.


I suppose it's possible that they simply do not enough error corruption, but given that I don't get IO errors, only wrong data, I doubt it.

Typical algorithms work like:

* 1 bad bit is corrected

* 2 bad bits produce an error. Since there's nothing else meaningful to do, this can only propagate to the higher layer.

* 3 or more bad bits may produce wrong data or an error

so we wouldn't expect to hit case 3 without hitting case 2.


> On a test with a Linksys NSLU2 (ARMv5TE processor) the speedup was a factor 5 (big endian mode, gcc 4.1.2, -O3)

Now that's a name I hadn't heard in a long long time.

The post is from 2008 anyway, I wonder what would happen if one were to put this code (or the earlier, non-unrolled-loop versions) to the test with modern compilers that can do all sorts of black magic for performance, or if it could be extended to take advantage of modern CPU features.


https://godbolt.org/z/szMvnTT63

Looking at the naive code from Analysis 1 and comparing GCC 4.1.2 to GCC 13.2 there's a huge difference. Even when forcing both to generate code for the same old architecture (k8) so its more fair.

GCC 4.1 generates code which is a pretty direct translation of the source: branchy, operates on bytes.

GCC 13.2 does a ton of optimization. There's only a single jump/loop. It uses prefetch. It uses SSE instructions to operate on 128-bits at a time. It looks like all the instructions are duplicated as well to operate on a pair of registers, too. I'm not exactly sure.


GCC is just vectorizing the code literally. It's doing multiple bytes in each loop iteration, but the vectorized code is still bit-testing the index to conditionally XOR into each accumulator. This is still much less efficient than taking advantage of the regular patterns as in the latter parts of the article. It also runs with a lot less lanes than it could, because it uses 32-bit lanes instead of 8-bit for some reason.


I encountered NAND ECC about 10 years ago. We wanted to increase the size of the NAND flash on our embedded device (a data logger and gateway used to monitor and control solar power systems). The larger NAND parts didn't have built in error correction. Enabling software error correction meant we had to move the kernel onto a small SPI-flash part.

This was all pretty straightforward, and probably increased reliability, but it introduced new complexity into our build, install and upgrade process, because now I had to take into account the different flash layout in otherwise almost identical devices.

Since then I learned it's much easier to use eMMC. You don't need software block management and wear leveling, you just read the specs and let the manufacturer handle everything for you.


(2008) (Note the reference to gcc 4.2.)


Is there an NVMe manufacturer that sells their product as just naked flash? The first rule of hardware is that hardware providers don't know how to write software. It would be nice if we could get rid of the proprietary and buggy embedded operating systems running our storage as an opaque abstraction layer. Things like lying about flush that was recently on HN would stop being a problem.


While not going as far as you want, one movement in that direction is NVMe Zoned Namespaces (https://www.anandtech.com/show/15959/nvme-zoned-namespaces-e...).


Flexible Data Placement is another interesting development in the NVMe space.


These are called "open-channel SSDs", which seem to currently mainly be consigned to research purposes AFAICT. But some do exist.

https://en.wikipedia.org/wiki/Open-Channel_SSD


The NAND flash manufacturers usually have no software involvement unless they are in the business of also selling SSD controller SoCs. They will typically sell FDKs/SDKs that are heaping piles of garbage written overseas for their SoCs that provide minimum functionality for interfacing with all of the components.

I wouldn't really expect controller-less NAND flash to ever meaningfully compete with what we have now. There is a lot more specialized hardware in a typical SoC than just a microprocessor and NAND flash chips


Well, by definition, naked flash wouldn't be NVMe... so no.


It would just be flash


I expected discussion of __builtin_parity and cmov at least.


This article references gcc 4.1, __builtin_parity is likely newer than that.


Me too. Reading the article, there seems to be no attempt to replace the two remaining "if" statements, and so, avoid branches. It doesn't even need to be CMOV: possibly introducing tmppar1 and tmppar2 = tmppar & (an operation that is either 0 or -1). And then always rp12 ^= tmppar1; rp14 ^= tmppar2; -- it might not be faster, but that's what I would try at least.


One more... Any benefit would be CPU specific:

if ((i & 0x1) == 0) rp12 ^= tmppar;

Seems like it's ripe for something like:

rp12 ^= tmppar & ~(0 - (i & 0x1));

If my logic is functioning ;-) this eliminates the branch, but many modern CPUs running linux probably have a conditional mov which should be faster with the straightforward version.

Edit: Except RISC-V, there are no conditionals other than branch. Maybe this will help on that arch.


No directly related, but I wish there was a 'standard' linux documentation that look more like Python docs.


The kernel docs are generated with Sphinx, just like the Python docs are. Are you saying you wish we used the Python theme?


No, I was referring to the structure of Python docs. More specifically, I want there to be something like the 'Library Reference'. It starts from the basics and moves to more abstract/advanced topics. I guess the current docs are optimized for many different use cases, but as a beginner, it is very hard for me to gen an 'Overview' of linux from the official docs. There should better categorization IMO!


There is the Linux Programming Environment book, which covers most of the library.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: