On Windows, DWM's display compositing adds one frame of latency to every window on screen. It's not possible to render a dragged object in any window that sticks to the mouse cursor without at least one frame of latency.
But when you drag whole windows around they do stick to the mouse cursor with apparently zero frames of latency; how does DWM do it? Easy, they cheat by disabling the hardware mouse overlay during window dragging so that the mouse cursor gets that extra frame of latency too. You can prove this by enabling "Night light" in settings; watch the mouse cursor change colors as it transitions from hardware overlay to software rendering when you start dragging a window.
Could compositors be optimised to eliminate the extra frame of lag in the case where every window on-screen is being displayed “directly” by invisibly switching to a mode that maps each scanline and pixel column to a window’s framebuffer - and non-client areas to the window-manager’s UI buffer which is directly read by the monitor signal generator. While this would mean transparency effects wouldn’t work, it could supported with some special-casing. Basically a framebuffer-less hardware compositor. I think rendering windows to 3D deformable meshes [makes for cool demos](https://youtu.be/USedxVrU2Ko) but in practice we just don’t use it for anything besides window open/close animations.
I had to use a monitor running at 30Hz for a while (4K over HDMI 1.4) and while that was bad enough, the compositor’s lag meant all window contents had an extra (unnecessary IMO) delay of 33ms. Add on to that normal monitor input lag.
We’ll probably all shift to 120Hz w/ variable-rate refreshing as a new baseline standard over the next 10 years as Apple seems to be heading in that direction - at 120Hz the lag of the compositor would be acceptable - but I’m worried lazy that graphics devs are going to use that as an excuse to add another frame of latency...
> Could compositors be optimised to eliminate the extra frame of lag in the case where every window on-screen is being displayed “directly” by invisibly switching to a mode that maps each scanline and pixel column to a window’s framebuffer
Yes. This concept is called hardware overlays and there are varying levels of support for it in different GPUs and compositors.
There are tradeoffs. Using multiple hardware overlays may cost extra power and/or memory bandwidth, the number of supported overlays may be very limited, alpha blending may not be supported, and the transforms that can be applied to overlays may be very limited. The extremely hardware specific nature of the restrictions and the lack of good APIs exposing overlays means they get much less use than they should.
Technically if your pixel operation is commutative and reversible (as could be with alpha blending), you could store a buffer of pixels at the current position, then undo the previous pixel by current window at that buffer and then re-apply the operation with the new pixel value, and then directly send this to the display?
Am I missing something apart from the fact that alpha blending is not actually commutative and/or reversible; and the fact that nobody implemented this yet?
>On Windows, DWM's display compositing adds one frame of latency to every window on screen. It's not possible to render a dragged object in any window that sticks to the mouse cursor without at least one frame of latency.
AFAIK you can bypass this by using dxgi flip model so no additional latency is incurred. There's still is going to be 1 frame of latency from the vsync though.
>You can prove this by enabling "Night light" in settings; watch the mouse cursor change colors as it transitions from hardware overlay to software rendering when you start dragging a window.
can't reproduce on my end. maybe the upped the night light implementation so the hardware cursor is tinted as well.
> AFAIK you can bypass this by using dxgi flip model so no additional latency is incurred.
Using the flip model only eliminates the latency if DWM promotes your window to a hardware overlay. On Nvidia systems this is simply not supported, so the latency is always there and it's impossible to get rid of it. Maybe DWM supports overlays on Intel or AMD, I'm not sure. It would be interesting for someone to test this.
> There's still is going to be 1 frame of latency from the vsync though.
Vsync does not inherently require any extra latency. You can render as close to vsync as you like to reduce the latency an arbitrary amount. That's what VR compositors do. All you need to do is ensure you can't flip during scanout and you can't get tearing.
My understanding is that promoting a window to hardware overlay is only supported on Kaby Lake and later Intel integrated graphics, and there it's a heuristic, so there's no way to guarantee getting it. You do have to be in flip mode, but in flip mode smooth resizing can't be done without artifacts. Currently druid downgrades to direct2d hwnd render targets during a live resize, but this feels hacky and is likely creating other problems.
I've spent a fair amount of time investigating this and have a mostly written blog post on it, but right now I'm kind of sick of the topic - it's a good illustration of how easily software evolves into stuff that's complex and broken.
Great info, thanks. I'd love to see that blog post.
I similarly got fed up with it after a bunch of investigation. I also have a speculative suspicion that the reason overlays aren't supported is that they were artificially omitted from the GeForce driver to support Nvidia's Quadro price discrimination. Ugh.
> It's not possible to render a dragged object in any window that sticks to the mouse cursor without at least one frame of latency.
I thought this was a fact of all window managers?
I'd noticed it when making games in SDL / SDL2 on Linux and just assumed it was because the X server couldn't possibly wait on me to paint a frame before updating its own cursor
Wait are you saying DWM literally hides the cursor and then draws its own cursor manually? I thought it would change to a different type of fallback rendering in the kernel or something. Interesting, okay thanks.
>This happens in a buffer and is normally one display update behind in time.
This assumes compositors perform their work right after each display refresh. Compositors can decide to perform their work later, some amount of time before the next display refresh (e.g. a few milliseconds). This allows to reduce latency because the new buffers submitted by clients (such as web browsers) can be displayed with less than 1 refresh period worth of latency. For instance the browser can update its buffer at last display refresh + 8ms, then the compositor can composite at last display refresh + 13ms, and the new frame can be displayed at last display refresh + 16ms.
Here's for instance how Weston does it: [1]. Sway has a similar feature.
>However since pointing with a cursor is such a core experience in these OS'es, the "screen compositor" usually have special code to draw the cursor on screen as late as possible—as close in time to an actual display refresh as possible—to be able to use the most recent position data from the input device driver.
That's not entirely true. Nowadays all GPUs have a feature called "cursor plane". This allows the compositor to configure the cursor directly in the hardware and to avoid drawing it. So when the user just moves the mouse around the compositor doesn't need to redraw anything, all it needs to do is update the cursor position in the hardware registers.
Compositors don't have code to draw the cursor as late as possible. Instead, they program the cursor position when drawing a new frame. (On some hardware this allows the compositor to "fixup" the cursor position in case some input events happen after drawing and before the display refresh.)
But in the end, all of this doesn't really matter. What matters is that the app draws before the compositor draws, thus the compositor will have a more up-to-date cursor position.
The neglect for latency in current popular systems such as Linux sickens me.
I suggest experimenting with cyclictest from rt-tests. On all hardware I've tried, I get 30ms+ peaks after running it on the background for not even very long. I can't comprehend how anybody could find this acceptable.
I do run linux-rt for this reason. Then again, while linux-rt provides the tools to make latency reasonable, the rest of the system hardly does use them.
As we move from the likes of Linux to better architected systems, potentially based on seL4, I do hope the responsiveness will return to sanity. Until then, I'll have to keep going back to my Amiga hardware as cope mechanism.
The jump in rhel from 6 to 7 basically made it incredibly hard to tune Linux for very low latency performance requirements. Fairly simple on 6 but 7 made it very difficult. There are lots of tools available, nohz etc, but it doesn't help much. Primary core on each numa node is also loaded with kernel threads causing huge amounts of jitter.
Basically everything is tuned for running web apps with loads of procs for people who don't really care about latency of 100s of millis.
Why would a real-time OS help at all with latency? All RT means is that the latency can be reliably upper-bounded (but note that that upper bound might be very high/slow), it doesn't mean that the latency will be reduced. Real-time OSs aren't faster.
linux-rt is a patchset that changes the behavior of linux to increase the number of places where preemption can occur (among other things).
Doing this decreases certain types of latency in certain situations. As an example, it tries to have interrupts disabled less frequently and for shorter intervals, and uses mutexes instead of spinlocks.
As a result, using linux-rt can provide a lower latency experience compared plain linux.
Ah, that's fair enough, but it isn't 'real time', which is the thing I was assuming from the 'RT' in the name. Perhaps linux-ll would be a better name, for 'low latency'. RT just confuses what it is trying to do.
It is trying to make the linux kernel more real time capable. Having periods of time where preemption isn't enabled (due to having interrupts disabled, etc) results in more variation in when tasks are scheduled, including real time tasks.
The reality is that "real time" as a definition covers many "features" and design choices because many ducks need to be in a row for real time tasks to run properly. Decreasing variation in the scheduling of (real time) tasks is one of those items.
As a result, it's entirely reasonable to call "linux-rt" "linux-rt".
Is extremely desirable. Those multi-ms peaks of latency Linux has are the ones that cause audio cuts and perceived hiccups.
Of course it doesn't matter perceptually if the average is 1µs or 5µs. It's all about the peaks, and keeping them bounded enough so that latency does never cross the perceptual threshold.
But none of the apps they would be running (their browser, in this specific case) are RT. So if the application isn't asking for a hard limit on latency, they aren't going to get anything different on a RT OS.
If you try to "predict the present" based on the past (and when you use previous points to calculate velocity and acceleration, that's what you're doing) it will overshoot when there's a change in direction, and how much depends on how aggressively you try to extrapolate. For the one-dimensional case in signal processing, doing this with a quickly-changing signal like a square wave will result in ringing.
It can smooth things a bit but it's not that good a substitute for actually improving latency.
(There are probably consequences for coronavirus charts as well, since they're based on lagging data.)
Although I agree that there's no substitute for actually improving latency, I think it's possible to do significantly better at prediction. Mouse movements are not easily predictable but they are also not completely random; this is a good type of problem to apply machine learning to.
Ultimately you want the lowest possible latency and prediction, because you can never get the latency to zero. Once the latency is small enough, prediction becomes a net win. For example, all VR devices do prediction for head and hand positions after lowering latency as much as possible elsewhere.
I would reckon that overshoot (the mouse cursor moving "backwards" after over-prediction) is significantly worse than undershoot in terms of user experience. We can easily compensate for mouse acceleration not being constant (ask anyone with enhanced pointer speed). But the pointer doing qualitatively different movements than what you input is annoying.
In the limit I guess this boils down to "do no prediction" (which I also suppose is what the linked site's conclusion is).
I'm seeing <2ms in Edge Chromium and ~10ms in Firefox on a 144 Hz display. I'm curious how that compares to what other people are seeing.
I've been doing some WebGL work recently and I've noticed that while it reaches ~144 fps using requestAnimationFrame() in Firefox, there's a lot of stuttering. It's very smooth at 144 fps in Edge Chromium, while Edge Legacy is below 80 fps. As far as I can tell it's not CPU bound, and it's definitely not GPU bound. It would be nice if I could get it running smoothly in Firefox but I don't know what to investigate.
> If you move your pointer left and right (or up and down) in sweeping motions and follow it with your eyes, you'll notice that the rectangle is trailing behind the pointer by quite a long distance
that's definitely not what I am observing (https://streamable.com/9u4cpx). Enabling the predictive tracking, however, is quite nauseating especially in circular motions. Please don't play with your users' cursors !
I didn't read the article, but I did try the checkboxes. What I saw surprised me and I will go read the article to see if it addresses my experience, but in case it isn't:
1. The predictive checkbox improved tracking my cursor.
2. Disabling `requestAnimationFrame` improved it more.
This is not what I'd have expected, so I'll include details about my environment:
- macOS 10.15.4
- Safari 13.1
- 2019 16" MBP with maxed RAM and ~25GB swap
I have no idea whether the browser or the memory pressure made same-thread tracking more accurate, but something did.
I recently experimented with implementing certain pointer-controlled effects on a <canvas>, and was discouraged by the jerky feeling caused by latency.
But I noticed that if I rendered the effect with motion blur, it suddenly started to feel much smoother, and the perception of jerkiness was mostly gone. I felt that it completely restored my sense of control of the motion.
It’s surprising considering that motion blur actually adds one half frame of extra latency.
Since trying this, I’m bothered by how jerky fast mouse movement always feels in MacOS. 60 fps leaves these enormous, ugly gaps between the pointer at each frame, and makes it hard to perceive the motion correctly. I can’t unsee these gaps now! I’m convinced that system-wide motion blur just for the pointer would be a simple way to make the whole OS feel much smoother and more responsive.
I have had a similar experience when first using a 144hz display. I was amazed how responsive the mouse was. How "in control" I felt.
Then, going back to a 60hz display, I couldn't NOT see the gaps left by the cursor's movement. I had never before seen this as a problem, but seeing something better ruined 60hz for me.
The dead-reckoning algorithm seems to do well when moving in a straight line but my impression is that it does worse if there are curves because it veers ourside the path that mouse actually traced. For example, when moving the mouse in a circle the predicted squaretrace appears to move in a circle with a larger radius.
What kind of algorithm could be used to improve the accuracy for curves?
Ha; there's a little architecture grognard subthread on this unrelated topic.
The Pentium 4 L1 cache was a miracle for its time, and once the P4 was clocked to Peak Netburst levels the 2-cycle latency looks really good.
Tradeoffs on modern system are different - a Skylake cache may have 4-5 cycle latency on access, but is 4 times bigger (32 rather than 8KiB), can execute twice as many loads per cycle, and is write-back rather than write-through (more complex to design, but more scalable with lots of cores).
You can still get your ass kicked by a ancient system if you pick just the right pointer-chasing microbenchmark. This had some real implications for regex implementation, given that a straightforward DFA implementation (and many string matching algorithms like Aho-Corasik) are really just pointer-chasing.
On Windows, DWM's display compositing adds one frame of latency to every window on screen. It's not possible to render a dragged object in any window that sticks to the mouse cursor without at least one frame of latency.
But when you drag whole windows around they do stick to the mouse cursor with apparently zero frames of latency; how does DWM do it? Easy, they cheat by disabling the hardware mouse overlay during window dragging so that the mouse cursor gets that extra frame of latency too. You can prove this by enabling "Night light" in settings; watch the mouse cursor change colors as it transitions from hardware overlay to software rendering when you start dragging a window.