This relies a lot on being able to detect bots. Everything you said could be easily bypassed with a small to moderate amount of effort on the side of crawler's creators. Distinguishing genuine traffic has always been hard and it will not get easier in the age of AI.
Removing elements that match `[hidden], [aria-hidden]` is the most trivial cleanup transform a crawler can do and I'm sure most crawlers already do that.
But the very comment you answered explains how to do it: a page forbidden in robots.txt. Does this method need explanation why it's ideal for sorting humans and google, from malicious crawlers?
robots.txt is a somewhat useful tool for keeping search engines in line, because it's rather easy to prove that a search engine ignores robots.txt: when a noindex page shows up in SERPs. This evidence trail does not exist for AI crawlers.
It was a while ago, and it was not deliberate (wget downloaded robots.txt as well as the files I requested, and I was able to find many other files due to that, some of which could not be accessed due to requiring a password, but some were interesting (although I did not use wget to copy those other files; I only wanted to copy the files I originally requested)).
robots.txt is not a sitemap. If it worked that way you could just make a 5TB file linking to a billion pages that look like static links but are dynamically generated.