Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Pointermove event latency in web browsers (rsms.me)
98 points by secondo on April 25, 2020 | hide | past | favorite | 56 comments


I've done a lot of work testing this type of latency in web browsers: https://google.github.io/latency-benchmark/

On Windows, DWM's display compositing adds one frame of latency to every window on screen. It's not possible to render a dragged object in any window that sticks to the mouse cursor without at least one frame of latency.

But when you drag whole windows around they do stick to the mouse cursor with apparently zero frames of latency; how does DWM do it? Easy, they cheat by disabling the hardware mouse overlay during window dragging so that the mouse cursor gets that extra frame of latency too. You can prove this by enabling "Night light" in settings; watch the mouse cursor change colors as it transitions from hardware overlay to software rendering when you start dragging a window.


Could compositors be optimised to eliminate the extra frame of lag in the case where every window on-screen is being displayed “directly” by invisibly switching to a mode that maps each scanline and pixel column to a window’s framebuffer - and non-client areas to the window-manager’s UI buffer which is directly read by the monitor signal generator. While this would mean transparency effects wouldn’t work, it could supported with some special-casing. Basically a framebuffer-less hardware compositor. I think rendering windows to 3D deformable meshes [makes for cool demos](https://youtu.be/USedxVrU2Ko) but in practice we just don’t use it for anything besides window open/close animations.

I had to use a monitor running at 30Hz for a while (4K over HDMI 1.4) and while that was bad enough, the compositor’s lag meant all window contents had an extra (unnecessary IMO) delay of 33ms. Add on to that normal monitor input lag.

We’ll probably all shift to 120Hz w/ variable-rate refreshing as a new baseline standard over the next 10 years as Apple seems to be heading in that direction - at 120Hz the lag of the compositor would be acceptable - but I’m worried lazy that graphics devs are going to use that as an excuse to add another frame of latency...


> Could compositors be optimised to eliminate the extra frame of lag in the case where every window on-screen is being displayed “directly” by invisibly switching to a mode that maps each scanline and pixel column to a window’s framebuffer

Yes. This concept is called hardware overlays and there are varying levels of support for it in different GPUs and compositors.

There are tradeoffs. Using multiple hardware overlays may cost extra power and/or memory bandwidth, the number of supported overlays may be very limited, alpha blending may not be supported, and the transforms that can be applied to overlays may be very limited. The extremely hardware specific nature of the restrictions and the lack of good APIs exposing overlays means they get much less use than they should.


Technically if your pixel operation is commutative and reversible (as could be with alpha blending), you could store a buffer of pixels at the current position, then undo the previous pixel by current window at that buffer and then re-apply the operation with the new pixel value, and then directly send this to the display?

Am I missing something apart from the fact that alpha blending is not actually commutative and/or reversible; and the fact that nobody implemented this yet?


>On Windows, DWM's display compositing adds one frame of latency to every window on screen. It's not possible to render a dragged object in any window that sticks to the mouse cursor without at least one frame of latency.

AFAIK you can bypass this by using dxgi flip model so no additional latency is incurred. There's still is going to be 1 frame of latency from the vsync though.

>You can prove this by enabling "Night light" in settings; watch the mouse cursor change colors as it transitions from hardware overlay to software rendering when you start dragging a window.

can't reproduce on my end. maybe the upped the night light implementation so the hardware cursor is tinted as well.


> AFAIK you can bypass this by using dxgi flip model so no additional latency is incurred.

Using the flip model only eliminates the latency if DWM promotes your window to a hardware overlay. On Nvidia systems this is simply not supported, so the latency is always there and it's impossible to get rid of it. Maybe DWM supports overlays on Intel or AMD, I'm not sure. It would be interesting for someone to test this.

> There's still is going to be 1 frame of latency from the vsync though.

Vsync does not inherently require any extra latency. You can render as close to vsync as you like to reduce the latency an arbitrary amount. That's what VR compositors do. All you need to do is ensure you can't flip during scanout and you can't get tearing.


My understanding is that promoting a window to hardware overlay is only supported on Kaby Lake and later Intel integrated graphics, and there it's a heuristic, so there's no way to guarantee getting it. You do have to be in flip mode, but in flip mode smooth resizing can't be done without artifacts. Currently druid downgrades to direct2d hwnd render targets during a live resize, but this feels hacky and is likely creating other problems.

I've spent a fair amount of time investigating this and have a mostly written blog post on it, but right now I'm kind of sick of the topic - it's a good illustration of how easily software evolves into stuff that's complex and broken.


Great info, thanks. I'd love to see that blog post.

I similarly got fed up with it after a bunch of investigation. I also have a speculative suspicion that the reason overlays aren't supported is that they were artificially omitted from the GeForce driver to support Nvidia's Quadro price discrimination. Ugh.


> It's not possible to render a dragged object in any window that sticks to the mouse cursor without at least one frame of latency.

I thought this was a fact of all window managers?

I'd noticed it when making games in SDL / SDL2 on Linux and just assumed it was because the X server couldn't possibly wait on me to paint a frame before updating its own cursor


Do you know how DWM disables the hardware mouse overlay? Is it an IOCTL?


https://docs.microsoft.com/en-us/windows/win32/api/winuser/n...

any application can do it, hence why you get laggy cursor in some games that opt to draw their own cursor.


Wait are you saying DWM literally hides the cursor and then draws its own cursor manually? I thought it would change to a different type of fallback rendering in the kernel or something. Interesting, okay thanks.


>This happens in a buffer and is normally one display update behind in time.

This assumes compositors perform their work right after each display refresh. Compositors can decide to perform their work later, some amount of time before the next display refresh (e.g. a few milliseconds). This allows to reduce latency because the new buffers submitted by clients (such as web browsers) can be displayed with less than 1 refresh period worth of latency. For instance the browser can update its buffer at last display refresh + 8ms, then the compositor can composite at last display refresh + 13ms, and the new frame can be displayed at last display refresh + 16ms.

Here's for instance how Weston does it: [1]. Sway has a similar feature.

>However since pointing with a cursor is such a core experience in these OS'es, the "screen compositor" usually have special code to draw the cursor on screen as late as possible—as close in time to an actual display refresh as possible—to be able to use the most recent position data from the input device driver.

That's not entirely true. Nowadays all GPUs have a feature called "cursor plane". This allows the compositor to configure the cursor directly in the hardware and to avoid drawing it. So when the user just moves the mouse around the compositor doesn't need to redraw anything, all it needs to do is update the cursor position in the hardware registers.

Compositors don't have code to draw the cursor as late as possible. Instead, they program the cursor position when drawing a new frame. (On some hardware this allows the compositor to "fixup" the cursor position in case some input events happen after drawing and before the display refresh.)

But in the end, all of this doesn't really matter. What matters is that the app draws before the compositor draws, thus the compositor will have a more up-to-date cursor position.

[1]: https://ppaalanen.blogspot.com/2015/02/weston-repaint-schedu...


The neglect for latency in current popular systems such as Linux sickens me.

I suggest experimenting with cyclictest from rt-tests. On all hardware I've tried, I get 30ms+ peaks after running it on the background for not even very long. I can't comprehend how anybody could find this acceptable.

I do run linux-rt for this reason. Then again, while linux-rt provides the tools to make latency reasonable, the rest of the system hardly does use them.

As we move from the likes of Linux to better architected systems, potentially based on seL4, I do hope the responsiveness will return to sanity. Until then, I'll have to keep going back to my Amiga hardware as cope mechanism.


> I can't comprehend how anybody could find this acceptable.

Because Linux is primarily funded by server companies, and servers are optimized almost exclusively for throughput?


Apple 2e from 1983 was the quickest, it is said: https://danluu.com/input-lag/


Of the systems tested. They didn't try AmigaOS. They didn't try freedos. Or haiku. Or netbsd.

But yeah, the point is clear: Current, popular desktop systems are pretty bad at responsiveness.


The jump in rhel from 6 to 7 basically made it incredibly hard to tune Linux for very low latency performance requirements. Fairly simple on 6 but 7 made it very difficult. There are lots of tools available, nohz etc, but it doesn't help much. Primary core on each numa node is also loaded with kernel threads causing huge amounts of jitter.

Basically everything is tuned for running web apps with loads of procs for people who don't really care about latency of 100s of millis.


Why would a real-time OS help at all with latency? All RT means is that the latency can be reliably upper-bounded (but note that that upper bound might be very high/slow), it doesn't mean that the latency will be reduced. Real-time OSs aren't faster.


linux-rt is a patchset that changes the behavior of linux to increase the number of places where preemption can occur (among other things).

Doing this decreases certain types of latency in certain situations. As an example, it tries to have interrupts disabled less frequently and for shorter intervals, and uses mutexes instead of spinlocks.

As a result, using linux-rt can provide a lower latency experience compared plain linux.


Ah, that's fair enough, but it isn't 'real time', which is the thing I was assuming from the 'RT' in the name. Perhaps linux-ll would be a better name, for 'low latency'. RT just confuses what it is trying to do.


It is trying to make the linux kernel more real time capable. Having periods of time where preemption isn't enabled (due to having interrupts disabled, etc) results in more variation in when tasks are scheduled, including real time tasks.

The reality is that "real time" as a definition covers many "features" and design choices because many ducks need to be in a row for real time tasks to run properly. Decreasing variation in the scheduling of (real time) tasks is one of those items.

As a result, it's entirely reasonable to call "linux-rt" "linux-rt".


>reliably upper-bounded.

Is extremely desirable. Those multi-ms peaks of latency Linux has are the ones that cause audio cuts and perceived hiccups.

Of course it doesn't matter perceptually if the average is 1µs or 5µs. It's all about the peaks, and keeping them bounded enough so that latency does never cross the perceptual threshold.


But none of the apps they would be running (their browser, in this specific case) are RT. So if the application isn't asking for a hard limit on latency, they aren't going to get anything different on a RT OS.


For something that normally takes milliseconds, it's metrics like 99% latency that matter, not average latency. It doesn't have to be "faster".


Personally I'm staying on X11 + no compositor for this reason.


If you try to "predict the present" based on the past (and when you use previous points to calculate velocity and acceleration, that's what you're doing) it will overshoot when there's a change in direction, and how much depends on how aggressively you try to extrapolate. For the one-dimensional case in signal processing, doing this with a quickly-changing signal like a square wave will result in ringing.

It can smooth things a bit but it's not that good a substitute for actually improving latency.

(There are probably consequences for coronavirus charts as well, since they're based on lagging data.)


Although I agree that there's no substitute for actually improving latency, I think it's possible to do significantly better at prediction. Mouse movements are not easily predictable but they are also not completely random; this is a good type of problem to apply machine learning to.

Ultimately you want the lowest possible latency and prediction, because you can never get the latency to zero. Once the latency is small enough, prediction becomes a net win. For example, all VR devices do prediction for head and hand positions after lowering latency as much as possible elsewhere.


I totally agree, you want both. Negative latency!

https://rauchg.com/2014/7-principles-of-rich-web-application...


That demo looks like the sort of attempt to be helpful that really irritates me on web sites.


I would reckon that overshoot (the mouse cursor moving "backwards" after over-prediction) is significantly worse than undershoot in terms of user experience. We can easily compensate for mouse acceleration not being constant (ask anyone with enhanced pointer speed). But the pointer doing qualitatively different movements than what you input is annoying.

In the limit I guess this boils down to "do no prediction" (which I also suppose is what the linked site's conclusion is).


Comment almost sounds like it's about PID.


Reminds me of a Kalman filter


I'm seeing <2ms in Edge Chromium and ~10ms in Firefox on a 144 Hz display. I'm curious how that compares to what other people are seeing.

I've been doing some WebGL work recently and I've noticed that while it reaches ~144 fps using requestAnimationFrame() in Firefox, there's a lot of stuttering. It's very smooth at 144 fps in Edge Chromium, while Edge Legacy is below 80 fps. As far as I can tell it's not CPU bound, and it's definitely not GPU bound. It would be nice if I could get it running smoothly in Firefox but I don't know what to investigate.


> I'm curious how that compares to what other people are seeing.

~10ms on Firefox, Linux, 60hz display.


1-2ms(avg) in FF on Linux on a 60Hz Display. 2.5-3.5 in Brave in a similar setup.


> If you move your pointer left and right (or up and down) in sweeping motions and follow it with your eyes, you'll notice that the rectangle is trailing behind the pointer by quite a long distance

that's definitely not what I am observing (https://streamable.com/9u4cpx). Enabling the predictive tracking, however, is quite nauseating especially in circular motions. Please don't play with your users' cursors !


The article does mention that the predictive tracking feels worse:

> predictive tracking will feel much worse than direct (technically lagging) tracking when there is no system cursors to match.

Additionally, we can see the lag between the red box and your cursor in the video of your screen that you've uploaded.

https://i.imgur.com/ZEBcGch.png


I didn't read the article, but I did try the checkboxes. What I saw surprised me and I will go read the article to see if it addresses my experience, but in case it isn't:

1. The predictive checkbox improved tracking my cursor.

2. Disabling `requestAnimationFrame` improved it more.

This is not what I'd have expected, so I'll include details about my environment:

- macOS 10.15.4

- Safari 13.1

- 2019 16" MBP with maxed RAM and ~25GB swap

I have no idea whether the browser or the memory pressure made same-thread tracking more accurate, but something did.


Also great bench for browsers and perf tester https://www.vsynctester.com/


I recently experimented with implementing certain pointer-controlled effects on a <canvas>, and was discouraged by the jerky feeling caused by latency.

But I noticed that if I rendered the effect with motion blur, it suddenly started to feel much smoother, and the perception of jerkiness was mostly gone. I felt that it completely restored my sense of control of the motion.

It’s surprising considering that motion blur actually adds one half frame of extra latency.

Since trying this, I’m bothered by how jerky fast mouse movement always feels in MacOS. 60 fps leaves these enormous, ugly gaps between the pointer at each frame, and makes it hard to perceive the motion correctly. I can’t unsee these gaps now! I’m convinced that system-wide motion blur just for the pointer would be a simple way to make the whole OS feel much smoother and more responsive.


I have had a similar experience when first using a 144hz display. I was amazed how responsive the mouse was. How "in control" I felt.

Then, going back to a 60hz display, I couldn't NOT see the gaps left by the cursor's movement. I had never before seen this as a problem, but seeing something better ruined 60hz for me.


The dead-reckoning algorithm seems to do well when moving in a straight line but my impression is that it does worse if there are curves because it veers ourside the path that mouse actually traced. For example, when moving the mouse in a circle the predicted squaretrace appears to move in a circle with a larger radius.

What kind of algorithm could be used to improve the accuracy for curves?


For just ellipses (which includes circles), a 2nd derivative prediction will work perfectly. Obviously there are paths that are not predictable though


I got a bit excited thinking this may go into latency of dereferencing pointers in C.


Same! I wrote a blog post that _kind of_ talks about that.

https://www.forrestthewoods.com/blog/memory-bandwidth-napkin...


Interesting results. Any ideas why the L1 got slower?


Ha; there's a little architecture grognard subthread on this unrelated topic.

The Pentium 4 L1 cache was a miracle for its time, and once the P4 was clocked to Peak Netburst levels the 2-cycle latency looks really good.

Tradeoffs on modern system are different - a Skylake cache may have 4-5 cycle latency on access, but is 4 times bigger (32 rather than 8KiB), can execute twice as many loads per cycle, and is write-back rather than write-through (more complex to design, but more scalable with lots of cores).

You can still get your ass kicked by a ancient system if you pick just the right pointer-chasing microbenchmark. This had some real implications for regex implementation, given that a straightforward DFA implementation (and many string matching algorithms like Aho-Corasik) are really just pointer-chasing.


Thanks for the info.


Ok, we've disambiguated the bejesus out of the title above.


Was the original title the article's title? I'm really curious now.


"Pointer latency". I'd have mentioned that, but it's the article's own title.


Thanks!


It would be great to have these numbers indeed !


Same! I would love to read that


Yeah me too!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: