But one of the goals of this is to be able to handle Video and audio together. (...

But one of the goals of this is to be able to handle Video and audio together. (This enables an easier API for ensuring audio and video remain in sync with each other, which can be tricky in some scenarios when both use totally seperate APIs.)

The other main goal is to simultaneously support both pro-audio flows like JACK, and consumer flows like PulseAudio without all the headaches caused by trying to run both of those together.

Lastly PipeWire is specifically designed to support the protocols of basically all existing audio daemons. So if the new APIs provide no benefit to your program, then you might as well just ignore it, and continue to use PulseAudio APIs or JACK APIs or the ESD APIs or the ALSA APIs or ... (you get the idea).

Now you are not wrong that audio is a real time task, and that there are advantages to running part of it kernel side (especially if low latency is desired, since the main way to mitigate issues from scheduling uncertainties is to use large buffers, which is the opposite of low latency).

On the other hand, I'm not sure an API like you propose will work as needed. For example, There really are cases where sources A, B, C and D need to be output to devices W, X, Y, and Z, but with different mixes for each, some of which might need delays added, effects (like reverb, compression, application of frequency equalization curves, etc) applied, and I have not even mentioned yet that device W is not a physical device, but actually the audio feed for a video stream to be encoded and transmitted live.

Try designing something that can handle all of that kernel side. Some of it you will have no chance of running in kernel mode obviously. That typically implies that everything before it in the audio pipeline ought to get done in user mode. Otherwise the kernel mode to user mode transition has most of the scheduling concerns that a full user-space audio pipeline implementation has. For things like per output device effects that would imply basically the whole pipeline be in user mode.

The whole thing is a very thorny issue with no perfect solutions, just a whole load of different potential tradeoffs. Moving more into kernel mode may the a sensible tradeoff for some scenarios, yet for others that kernel side implementation may be unusable, and just contributing more complexity to the endless array of possible audio APIs.