More

jardah · on June 2, 2018

Good point. Haven't seen a single detection library do this, but at least now I know, that I still need to work on alternative solution. Thanks

jardah · on June 1, 2018

The webdriver property is as far as we know the only one that stays different if you use non-headless chrome with puppeteer. Rest can be handled by use of non-headless chrome as mentioned in the article.

But you are right, after reading through it again, this section of the article should be improved.

jardah · on June 1, 2018

That is kinda sad to hear. The approach should always be to go through the path of least resistance and smallest effect on the website. So for example, if a company has API that can be used instead of scraping their website, then it's always preferred to use the API. Same would go for the XML you mentioned.

It's bad that not everyone works like this; there are quite a lot of people who would rather brute-force a solution than think about it.

wumpus · on June 1, 2018

The path of least resistance for the bots appears to be that they have a tool that scrapes search results, and nothing to talk to an API.

jardah · on June 1, 2018

Yea, it was a very general example, since there is at least one rule that is based on rate limiting too, and this 300/IP limit is what have seen on average.

jardah · on June 1, 2018

Yes it is, and since it's IP based, it's even easier if you are for example working from an office and there are multiple people using google.

But that is why they only show recaptcha, you fill it in and you will get extemption cookie for 30 more requests :D

buzer · on June 1, 2018

> But that is why they only show recaptcha, you fill it in and you will get extemption cookie for 30 more requests :D

Does that actually work? Whenever I have been searching some obscure things and managed to get the captcha after 6-10 pages, it just goes to loop where it keeps giving it constantly. Though it stops giving it if I change the search terms.

confounded · on June 1, 2018

With a VPN on Brave on iOS, Google will only show me infinite captchas.

jardah · on Feb 14, 2018

Amazon is unfortunately not using any metadata information for reviews (probably to prevent easy scraping for competing companies). You can only get it from from html (At least from what I can see).

jardah · on Feb 14, 2018

Depends on whether we access the website from a proxy that is known by the WAF. But for most websites it's just a single normal request. If it's an issue in the future we could make browser extension, that will do the analytic on page loaded by the user, so that we don't have to use proxy to connect to it. If you are talking about actually scraping the websites, then that is usually on case by case scenario. Mostly it works, but sometimes it's a bit harder to get around.

jardah · on Feb 13, 2018

Just a quick update: Thank you for using it and playing around with it. Looking at the usage and results I found a quite a lot of things to improve. Which is great, since it's hard to develop something like this without real usage data.

jardah · on Feb 13, 2018

Yes, that is probably the problem, when I looked for the text it returned:

[ 0:{ "selector":".bloc-blanc > p:nth-child(1)" "text":" 0 école(s) correspondent à votre recherche " } ]

jardah · on Feb 13, 2018

Aha! I see, it shows data based on POST request from FORM on this page http://www.dsden93.ac-creteil.fr/spip/spip.php?page=annu1d so if you provide just a link to the results page without the POST data then it will show you nothing. Sadly the tool currently does not allow for sending POST requests to the websites.

guilamu · on Feb 13, 2018

Thanks for your replies, I've successfully been parsing this page with others parsers though.

Edit: the page changed and it's not working anymore. Sorry for the false alarm, my bad.

jardah · on Feb 13, 2018

When I open the link in my browser it shows "0 école(s) correspondent à votre recherche" and no table, probably what happens to the analyzer too.