The webdriver property is as far as we know the only one that stays different if you use non-headless chrome with puppeteer. Rest can be handled by use of non-headless chrome as mentioned in the article.
But you are right, after reading through it again, this section of the article should be improved.
That is kinda sad to hear. The approach should always be to go through the path of least resistance and smallest effect on the website. So for example, if a company has API that can be used instead of scraping their website, then it's always preferred to use the API. Same would go for the XML you mentioned.
It's bad that not everyone works like this; there are quite a lot of people who would rather brute-force a solution than think about it.
Yea, it was a very general example, since there is at least one rule that is based on rate limiting too, and this 300/IP limit is what have seen on average.
> But that is why they only show recaptcha, you fill it in and you will get extemption cookie for 30 more requests :D
Does that actually work? Whenever I have been searching some obscure things and managed to get the captcha after 6-10 pages, it just goes to loop where it keeps giving it constantly. Though it stops giving it if I change the search terms.
Amazon is unfortunately not using any metadata information for reviews (probably to prevent easy scraping for competing companies). You can only get it from from html (At least from what I can see).
Depends on whether we access the website from a proxy that is known by the WAF. But for most websites it's just a single normal request. If it's an issue in the future we could make browser extension, that will do the analytic on page loaded by the user, so that we don't have to use proxy to connect to it. If you are talking about actually scraping the websites, then that is usually on case by case scenario. Mostly it works, but sometimes it's a bit harder to get around.
Just a quick update: Thank you for using it and playing around with it. Looking at the usage and results I found a quite a lot of things to improve. Which is great, since it's hard to develop something like this without real usage data.
Aha! I see, it shows data based on POST request from FORM on this page http://www.dsden93.ac-creteil.fr/spip/spip.php?page=annu1d so if you provide just a link to the results page without the POST data then it will show you nothing. Sadly the tool currently does not allow for sending POST requests to the websites.