Set a hardware breakpoint and you'll know immediately. That's what he eventually...

alexvitkov · on Dec 31, 2024

He mentioned in the article that the corruption happens at a seemingly random spot the middle of a large buffer, and you can only have a HW breakpoint on 4 addresses in x86-64.

mgaunard · on Jan 3, 2025

it's only random when using ASLR.

quotemstr · on Dec 31, 2024

Reproduce the corruption under rr. Replay the rr trace. Replay is totally deterministic, so you can just seek to the end of the trace, set a hardware breakpoint on the damaged stack location, and reverse-continue until you find the culprit.

zorgmonkey · on Dec 31, 2024

rr is only works on Linux and the release of Windows TTD was after this blog post was published. Also the huge slowdown from time travel debuggers can sometimes make tricky bugs like this much harder to reproduce.

pm215 · on Dec 31, 2024

I would certainly try with a reverse debugger if I had one, but where the repro instructions are "run this big complex interactive program for 10 minutes" I wouldn't be super confident about successfully recording a repro. At least in my experience with rr the slowdown is enough to make that painful, especially if you need to do multiple "chaos mode" runs to get a timing sensitive bug to trigger. It might still be worth spending time trying to get a faster repro case to make reverse debug a bit more tractable.

IshKebab · on Dec 31, 2024

Sure let me just run `rr` on Windows...

saagarjha · on Dec 31, 2024

Hardware breakpoints don't work if the kernel is doing the writes, because the kernel won't let you enable them globally so they trigger outside of your program.

mgaunard · on Jan 4, 2025

If you use a decent kernel like Linux, there is an API to do that from userspace without requiring you to reboot your kernel under a debugger.

saagarjha · on Jan 4, 2025

I don't think I'm familiar with that API. What is it?

mgaunard · on Jan 4, 2025

It's part of perf_event, available since 2.6.33.

machine_coffee · on Dec 31, 2024

Also surprised an async completion was writing to the stack. You should normally pass a heap buffer to these functions and keep it alive e.g for the lifetime of the object being watched.

muststopmyths · on Dec 31, 2024

It's not an async completion. The call is synchronous.

Windows allows some synchronous calls to be interrupted by another thread to run an APC if the called thread is in an "alertable wait" state. The interrupted thread then returns to the blocking call, so the pointers in the call are expected to be valid.

Edit 2: I should clarify that the thread returns to the blocking call, which then exits with WAIT_IO_COMPLETION status. So you have to retry it again. but the stack context is expected to be safe.

APC is an "Asynchronous procedure call", which is asynchronous to the calling thread in that it may or may not get run. Edit: May or may not run a future time.

(https://learn.microsoft.com/en-us/windows/win32/sync/asynchr...)

There are very limited things you are supposed to do in an APC, but these are poorly documented and need one to think carefully about what is happening when a thread is executing in a stack frame and you interrupt it with this horrorshow.

Win32 API is a plethora of footguns. For the uninitiated it can be like playing Minesweeper with code. Or like that scene in Galaxy Quest where the hammers are coming at you at random times as you try to cross a hallway.

A lot of it was designed by people who, I think, would call one stupid for holding it wrong.

I suppose it's a relic of the late 80s and 90s when you crawled on broken glass because there was no other way to get to the other side.

You learn a lot of the underlying systems this way, but these days people need to get shit done and move on with their lives.

Us olds are left behind staring at nostalgically at our mangled feet while we yell at people to get off our lawns.

Fulgen · on Jan 1, 2025

> There are very limited things you are supposed to do in an APC, but these are poorly documented and need one to think carefully about what is happening when a thread is executing in a stack frame and you interrupt it with this horrorshow.

One must not throw a C++ exception across stack frames that don't participate in C++ stack unwinding, whether it's a Win32 APC, another Win32 callback, a POSIX signal or `qsort` (for the people that believe qsort still has a place in this decade). How the Win32 API is designed is absolutely irrelevant for the bug in this code.

muststopmyths · on Jan 5, 2025

I was talking about APCs and win32 api in general not this bug.

loeg · on Dec 31, 2024

select() (written in C, a language without exceptions) is synchronous, its authors just (reasonably) did not expect an exception to be thrown in the middle of it invoking a blocking syscall. The algorithm was correct in the absence of a language feature C simply does not have and that is relatively surprising (you don't expect syscalls to throw in C++ either).

PhiSchle · on Dec 31, 2024

You state this like an obvious fact, but it is only obvious if you either heard of something like this, or you've been through it.

From that point on I am sure he knew to do that. What's obvious to you can also just be your experience.