Can you elaborate on how crawling is optional for indexing? Isn't crawling a pre...

nostrademons · on Jan 25, 2013

You can discover a URL through finding a link to it on a publicly-accessible web page, even if crawling that link itself is not possible.

MiguelHudnandez · on Jan 25, 2013

Ohh, got it, thank you. So Google is aware that the URL exists, even though they know nothing about the content served at that URL.

I am just surprised that a URL with no associated content would be included in the index.

But now that I think about it more, why not? It will not show up except in extremely specific searches, and in those cases it is useful to the searcher.

coderdude · on Jan 26, 2013

I find this behavior annoying. Here's why:

https://www.google.com/search?q=unicorn+admin

4th result down (wbpreview.com) is shown in search results despite blocking crawling/indexing with robots.txt. The result displays "A description for this result is not available because of this site's robots.txt – learn more" and the title seems to be auto-generated. The goal was to de-index the listing but apparently that's not an option.

gizmo686 · on Jan 26, 2013

As franze pointed out, you can specify not to index in robots.txt (I have not confirmed this). The intent of dissalowing crawling is ambigous. Maybe they do not want their content cached, or the extra load on their server, or any number of reasons. If you need to de-index a site, you should use the robots.txt directive. If it has already been indexed and you need it de-indexed quickly, google offers tools to do so [1]

[1] http://support.google.com/webmasters/bin/answer.py?hl=en&...

coderdude · on Jan 26, 2013

Thank you for pointing that out to me.

nostrademons · on Jan 26, 2013

The way to prevent a site from being indexed at all is through a <meta name="robots" content="noindex,nofollow"> tag on the page or X-Robots-Tag HTTP header (both of which, ironically, require that you not robots.txt it out, because otherwise the page content will never be crawled), or through a Noindex directive in robots.txt (which is unspecified by the spec - Google supports it, but Yahoo and Bing don't).