This relies a lot on being able to detect bots. Everything you said could be eas...

hec126 · on March 26, 2025

You can sprinkle your site with almost-invisible hyperlinks. Bots will follow, humans will not.

rustc · on March 26, 2025

This would be terrible for accessibility for users using a screen reader.

mostlysimilar · on March 26, 2025

So would the site shutting down because AI bots are too much traffic.

MrResearcher · on March 27, 2025

<a aria-hidden="true"> ... </a> will result in a link ignored by screen readers.

rustc · on March 28, 2025

Removing elements that match `[hidden], [aria-hidden]` is the most trivial cleanup transform a crawler can do and I'm sure most crawlers already do that.

soco · on March 26, 2025

But the very comment you answered explains how to do it: a page forbidden in robots.txt. Does this method need explanation why it's ideal for sorting humans and google, from malicious crawlers?

majewsky · on March 26, 2025

robots.txt is a somewhat useful tool for keeping search engines in line, because it's rather easy to prove that a search engine ignores robots.txt: when a noindex page shows up in SERPs. This evidence trail does not exist for AI crawlers.

ccgreg · on March 29, 2025

I'd say a bigger problem is that people disagree about the meaning of nofollow and noindex.

sigmoid10 · on March 26, 2025

The detection and bypass is trivial: Access the site from two IPs, one disrespecting robots.txt. If the content changes, you know it's garbage.

delichon · on March 26, 2025

Yes, please explain. How does an entry in robots.txt distinguish humans from bots that ignore it?

voidUpdate · on March 26, 2025

When was the last time you looked at robots.txt to find a page that wasn't linked anywhere else?

zzo38computer · on March 26, 2025

It was a while ago, and it was not deliberate (wget downloaded robots.txt as well as the files I requested, and I was able to find many other files due to that, some of which could not be accessed due to requiring a password, but some were interesting (although I did not use wget to copy those other files; I only wanted to copy the files I originally requested)).

sharlos201068 · on March 26, 2025

Crawlers aren't interested in fake pages that aren't linked to anywhere, they're crawling the same pages your users are viewing.

danielheath · on March 26, 2025

Adding a disallowed url to your robots.txt is a quick way to get a ton of crawlers to hit it, without linking to it from anywhere. Try it sometime.

gadflyinyoureye · on March 26, 2025

Tuesday. But I have odd hobbies.

brookst · on March 26, 2025

robots.txt is not a sitemap. If it worked that way you could just make a 5TB file linking to a billion pages that look like static links but are dynamically generated.

ccgreg · on March 29, 2025

robots.txt has a maximum relevant size of 500 kib.