I think you have a misunderstanding of how disk IO happens. The CPU core sends a command to the disk "I want some this and that data", then the CPU core can go do something else while the disk services that request. From what I read the disk actually puts the data directly into memory by using DMA, without needing to involve the CPU.
So far so good, but then the question is to ensure that the CPU core has something more productive to do then just check "did the data arrive yet?" over and over and coordinating that is where good apis come in.
There is nothing in the sense of Python async or JS async that the OS thread or OS process in question could usefully do on the CPU until the memory is paged into physical RAM. DMA or no DMA.
The OS process scheduler can run another process or thread. But your program instance will have to wait. That’s the point. It doesn’t matter whether waiting is handled by a busy loop a.k.a. polling or by a second interrupt that wakes the OS thread up again.
That is why Linux calls it uninterruptible sleep.
EDIT: io_uring would of course change your thread from blocking syscalls to non-blocking syscalls. Page faults are not a syscall, as GP pointed out. They are, however, a context-switch to an OS interrupt handler. That is why you have an OS. It provides the software drivers for your CPU, MMU, and disks/storage. Here this is the interrupt handler for a page fault.
What everyone forgets is just how expensive context switches are on modern x86 CPUs. Those 512 bit vector registers fill up a lot of cache lines. That's why async tends to win over processes / threads for many workloads.
It could work like this. "Hey OS I would like to process these pages* are they good to go? If not could you fetch and lock them for me" and then if they are ready you process them knowing it won't fault, and if they are not you do something else and try again later.
It's a sort of hybrid of the mmap and fread paradigms in that there are both explicit read requests but the kernel can also get you data on its own initiative if there are spare resources for it.
What advantages does that provide over using more OS threads. Ultimately this model is based on the idea that we want our programming runtimes to become increasingly responsible for low level scheduling concerns that have traditionally been handled by the OS scheduler.
I can broadly understand why there may be a desire to go down that path. But I’m not convinced that it would produce meaningful better performance than the current abstractions. Especially if you take a step back as ask the question: is mmap is the right tool to be using in these situations, rather using other tools like io_uring?
To be clear I don’t know the answer to this question. But the complexity of the solutions being suggested to potentially improve the mmap API really make me question if they’re capable of producing meaningful improvements.
It's hard to say on one hand "I use mmap because I don't want fancy APis for every read" and on the other "I want to do something useful on page fault" because you don't want to make every memory read a possible interruption point.
I think you have a misunderstanding of how the OS is signaled about disk I/O being necessary. Most of the post above was discussing that aspect of it, before the OS even sends the command to the disk.
So far so good, but then the question is to ensure that the CPU core has something more productive to do then just check "did the data arrive yet?" over and over and coordinating that is where good apis come in.