Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"...the Bing toolbar is effectively a search indexer which does not respect robots.txt..."

maybe i don't understand, but your logic seems correct until the above statement.

suppose you had a system that only used clickstream data. so you store a big list of url pairs (A,B) and a probability that B will follow A. i believe that your argument relies on the fact that it's possible for this system to violate a robots.txt file and i don't yet see it.



The intention of robots.txt is to tell search systems specifically "do not use information from the following pages in building a search index". The Bing toolbar's use of clickstream data from google, and no doubt many other sites, clearly violates that spirit.

This could easily be fixed, by checking the clickstream data against robots.txt files and discarding data that shouldn't be used. Microsoft apparently has decided not to take that step.


your assumptions:

  - the "intention" of the robots.txt standard is as you state
  - the url is included in the information not allowed by that standard
  - if the url is not included it should be because of the "intention"
  - toolbars are subject to the same standards
I'm not disagreeing with you as much as just pointing out that I don't think EVERYONE agrees on these standards.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: