Set a hardware breakpoint and you'll know immediately. That's what he eventually did, but he should have done so sooner.
Then obviously, cancelling an operation is always tricky business with lifetime due to asynchronicity. My approach is to always design my APIs with synchronous cancel semantics, which is sometimes tricky to implement. Many common libraries don't do it right.
He mentioned in the article that the corruption happens at a seemingly random spot the middle of a large buffer, and you can only have a HW breakpoint on 4 addresses in x86-64.
Reproduce the corruption under rr. Replay the rr trace. Replay is totally deterministic, so you can just seek to the end of the trace, set a hardware breakpoint on the damaged stack location, and reverse-continue until you find the culprit.
rr is only works on Linux and the release of Windows TTD was after this blog post was published. Also the huge slowdown from time travel debuggers can sometimes make tricky bugs like this much harder to reproduce.
I would certainly try with a reverse debugger if I had one, but where the repro instructions are "run this big complex interactive program for 10 minutes" I wouldn't be super confident about successfully recording a repro. At least in my experience with rr the slowdown is enough to make that painful, especially if you need to do multiple "chaos mode" runs to get a timing sensitive bug to trigger. It might still be worth spending time trying to get a faster repro case to make reverse debug a bit more tractable.
Hardware breakpoints don't work if the kernel is doing the writes, because the kernel won't let you enable them globally so they trigger outside of your program.
Also surprised an async completion was writing to the stack. You should normally pass a heap buffer to these functions and keep it alive e.g for the lifetime of the object being watched.
It's not an async completion. The call is synchronous.
Windows allows some synchronous calls to be interrupted by another thread to run an APC if the called thread is in an "alertable wait" state. The interrupted thread then returns to the blocking call, so the pointers in the call are expected to be valid.
Edit 2: I should clarify that the thread returns to the blocking call, which then exits with WAIT_IO_COMPLETION status. So you have to retry it again. but the stack context is expected to be safe.
APC is an "Asynchronous procedure call", which is asynchronous to the calling thread in that it may or may not get run.
Edit: May or may not run a future time.
There are very limited things you are supposed to do in an APC, but these are poorly documented and need one to think carefully about what is happening when a thread is executing in a stack frame and you interrupt it with this horrorshow.
Win32 API is a plethora of footguns. For the uninitiated it can be like playing Minesweeper with code. Or like that scene in Galaxy Quest where the hammers are coming at you at random times as you try to cross a hallway.
A lot of it was designed by people who, I think, would call one stupid for holding it wrong.
I suppose it's a relic of the late 80s and 90s when you crawled on broken glass because there was no other way to get to the other side.
You learn a lot of the underlying systems this way, but these days people need to get shit done and move on with their lives.
Us olds are left behind staring at nostalgically at our mangled feet while we yell at people to get off our lawns.
> There are very limited things you are supposed to do in an APC, but these are poorly documented and need one to think carefully about what is happening when a thread is executing in a stack frame and you interrupt it with this horrorshow.
One must not throw a C++ exception across stack frames that don't participate in C++ stack unwinding, whether it's a Win32 APC, another Win32 callback, a POSIX signal or `qsort` (for the people that believe qsort still has a place in this decade). How the Win32 API is designed is absolutely irrelevant for the bug in this code.
select() (written in C, a language without exceptions) is synchronous, its authors just (reasonably) did not expect an exception to be thrown in the middle of it invoking a blocking syscall. The algorithm was correct in the absence of a language feature C simply does not have and that is relatively surprising (you don't expect syscalls to throw in C++ either).
Then obviously, cancelling an operation is always tricky business with lifetime due to asynchronicity. My approach is to always design my APIs with synchronous cancel semantics, which is sometimes tricky to implement. Many common libraries don't do it right.