Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Can you elaborate on how crawling is optional for indexing? Isn't crawling a prerequisite to indexing?

The only exceptions I can think of are scary, like operating a caching proxy and scraping the cached data. Or scraping data from browsers that have loaded pages by user request.



You can discover a URL through finding a link to it on a publicly-accessible web page, even if crawling that link itself is not possible.


Ohh, got it, thank you. So Google is aware that the URL exists, even though they know nothing about the content served at that URL.

I am just surprised that a URL with no associated content would be included in the index.

But now that I think about it more, why not? It will not show up except in extremely specific searches, and in those cases it is useful to the searcher.


I find this behavior annoying. Here's why:

https://www.google.com/search?q=unicorn+admin

4th result down (wbpreview.com) is shown in search results despite blocking crawling/indexing with robots.txt. The result displays "A description for this result is not available because of this site's robots.txt – learn more" and the title seems to be auto-generated. The goal was to de-index the listing but apparently that's not an option.


As franze pointed out, you can specify not to index in robots.txt (I have not confirmed this). The intent of dissalowing crawling is ambigous. Maybe they do not want their content cached, or the extra load on their server, or any number of reasons. If you need to de-index a site, you should use the robots.txt directive. If it has already been indexed and you need it de-indexed quickly, google offers tools to do so [1]

[1] http://support.google.com/webmasters/bin/answer.py?hl=en&...


Thank you for pointing that out to me.


The way to prevent a site from being indexed at all is through a <meta name="robots" content="noindex,nofollow"> tag on the page or X-Robots-Tag HTTP header (both of which, ironically, require that you not robots.txt it out, because otherwise the page content will never be crawled), or through a Noindex directive in robots.txt (which is unspecified by the spec - Google supports it, but Yahoo and Bing don't).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: