Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This relies a lot on being able to detect bots. Everything you said could be easily bypassed with a small to moderate amount of effort on the side of crawler's creators. Distinguishing genuine traffic has always been hard and it will not get easier in the age of AI.


You can sprinkle your site with almost-invisible hyperlinks. Bots will follow, humans will not.


This would be terrible for accessibility for users using a screen reader.


So would the site shutting down because AI bots are too much traffic.


<a aria-hidden="true"> ... </a> will result in a link ignored by screen readers.


Removing elements that match `[hidden], [aria-hidden]` is the most trivial cleanup transform a crawler can do and I'm sure most crawlers already do that.


But the very comment you answered explains how to do it: a page forbidden in robots.txt. Does this method need explanation why it's ideal for sorting humans and google, from malicious crawlers?


robots.txt is a somewhat useful tool for keeping search engines in line, because it's rather easy to prove that a search engine ignores robots.txt: when a noindex page shows up in SERPs. This evidence trail does not exist for AI crawlers.


I'd say a bigger problem is that people disagree about the meaning of nofollow and noindex.


The detection and bypass is trivial: Access the site from two IPs, one disrespecting robots.txt. If the content changes, you know it's garbage.


Yes, please explain. How does an entry in robots.txt distinguish humans from bots that ignore it?


When was the last time you looked at robots.txt to find a page that wasn't linked anywhere else?


It was a while ago, and it was not deliberate (wget downloaded robots.txt as well as the files I requested, and I was able to find many other files due to that, some of which could not be accessed due to requiring a password, but some were interesting (although I did not use wget to copy those other files; I only wanted to copy the files I originally requested)).


Crawlers aren't interested in fake pages that aren't linked to anywhere, they're crawling the same pages your users are viewing.


Adding a disallowed url to your robots.txt is a quick way to get a ton of crawlers to hit it, without linking to it from anywhere. Try it sometime.


Tuesday. But I have odd hobbies.


robots.txt is not a sitemap. If it worked that way you could just make a 5TB file linking to a billion pages that look like static links but are dynamically generated.


robots.txt has a maximum relevant size of 500 kib.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: