DNS lookups are expensive at server scale. Bot traffic ranges from maybe a quart...

skissane · on Feb 12, 2017

> DNS lookups are expensive at server scale.

Even if you use a distributed in-memory cache to store the DNS lookup results? If User-Agent contains "Googlebot", then check for client IP in cache – if cached as "Google" continue, if cached as "Not Google" display error page. If client IP not in cache, do DNS lookup and then update cache – or, you could even let the request continue if there is a cache miss, and do the DNS lookup on another thread. That might let a Googlebot faker get a few requests through, but they'll be blocked very quickly unless they keep on changing IPs (something your average paywall bypasser will find hard.)

You don't need to consult the cache for every request, only for requests with User-Agent contains "Googlebot". (And any other bots you are choosing to let through your paywall, e.g. Bingbot.)

> Cheaper to look for Google ASNs.

How does one do that as an application developer? Do you need to speak to a BGP server?

dredmorbius · on Feb 12, 2017

A local caching DNS server should offer high-speed response. You could build out a tier if necessary. High-capacity, high-speed DNS is not a greenfield.

The genuine Google requests will typically be cached quickly, and given HTTP codes, you could give a nonpermanent error whilst you're adding new entries, so there's that.

The problem is your misses may come from all over. So they're both expensive and (potentially) large.

Even if Googlebot is only a fraction of requests, it's a large fraction, and I've not investigated how many of same are suspect.

For ASN, either the CIDR Report or Routeviews.org offer information. The latter has downloadable tables, in CIDR format, for IP-based lookups. I've long advocated that this be incorporated into routing hardware with reputation data allowing for linespeed determination of traffic acceptability and rate-limiting. Shifts toward VPN and Tor are making this slightly less useful than previously, but it should still be a generally worthwhile option.