Have you tried including the recaptcha v3 library and looking at the distribution of scores? -- https://developers.google.com/recaptcha/docs/v3 -- "reCAPTCHA v3 returns a score for each request without user friction"
It obviously depends on how motivated the scrapers are (i.e. whether their headless browsers are actually headless, and/or doing everything they can to not appear headless, whether Google has caught on to their latest tricks etc. etc.) but it would at least be interesting to look at the score distribution and then see whether you can cut off or slow down the < 0.3 scoring requests (or redirect them to your API docs)
For web scraping specifically, I’ve developed key parts of commercial systems to automatically bypass reCAPTCHA, Arkose Labs (Fun Captcha), etc.
If someone dedicated themselves to it, there’s a lot more that these solutions could be doing to distinguish between humans and bots, but it requires true specialized talent and larger expenses.
Also, for a handful of the companies which make the most popular captcha solutions, I don’t think the incentives align properly to fully segregate human and bot traffic at this time.
I think we’re still very much still picking at the very lowest hanging fruit, both for anti-bot countermeasures and anti-anti-bot (counter-countermeasures).
Personally I believe this will finally accelerate once AI’s can play computer games via a camera, keyboard, and mouse. And when successors GPT-3 / PaLM can participate well in niche discussion forums like HackerNews or the Discord server for Rust.
Until then it’s mainly a cost filter or confidence modification. As long as enough bots are blocked so that the ones which remain are technically competent enough to not stress the servers, most companies don’t care. And as long as the businesses deploying reCAPTCHA are reasonably confident that most of the views they get are humans (even if that belief is false), Google doesn’t have a strong incentive to improve the system.
Reddit doesn’t seem to care much either. As long as the bots which participate are “good enough”, it drives engagement metrics and increases revenue.
Scrapers can pay a commercial service to Mechanical Turk their way through reCAPTCHA. It makes a meaningful difference to scraping costs at scale, but sometimes it's still profitable.
It sounds great, until you have Chinese customers. That’s when you’ll figure out Recaptcha just doesn’t really work in China, and have to begrudgingly ditch it altogether…
It obviously depends on how motivated the scrapers are (i.e. whether their headless browsers are actually headless, and/or doing everything they can to not appear headless, whether Google has caught on to their latest tricks etc. etc.) but it would at least be interesting to look at the score distribution and then see whether you can cut off or slow down the < 0.3 scoring requests (or redirect them to your API docs)