System: Blocking fake Googlebot and bingbot spiders
It wasn't always like this. In the good ol' days of the Internet you could (mostly) trust that a user agent would properly identify itself. And search engines in particular would publish a list of the IP addresses they were using for indexing websites.
These days, however, hackers and spammers often try to pass themselves off as Googlebot or bingbot to get past other filters, and the only way to determine otherwise is to perform a lookup on their IP address.
Blocking fake bots using Apache directives
Fortunately, there's a very simple approach we can use in Apache2 which is to apply extra authentication just for user agents identifying themselves as search engines.
This relies on mod_authz_host which should be enabled on most hosts using Apache:
<If "%{HTTP_USER_AGENT} =~ /bingbot/ && ! -n %{HTTP:X-FORWARDED-FOR}">
Require host .search.msn.com
</If>
<If "%{HTTP_USER_AGENT} =~ /Googlebot/ && ! -n %{HTTP:X-FORWARDED-FOR}">
Require host .google.com .googlebot.com
</If>
This can be added to your server, vhost, or local .htaccess configuration, before any rewriting or redirecting. In theory there will be no slowdown for regular users of the website as the lookup is only performed conditionally based on the user agent.
The HTTP:X-FORWARDED-FOR check is just to avoid blocking IP addresses related to a CDN or similar service which is forwarding traffic to the website. Blocking your CDNs IP addresses is not a good idea.
How does it actually work?
The Apache 2.4 <If> directive is used to test, using a regular expression, whether the supplied User Agent string contains the search engine spider name, in this case either 'bingbot' or 'Googlebot'.
In the case that there is a match, an Apache Access Control rule is inserted, using the Require directive to restrict access to specific hostnames.
In our example above, we only want to allow user agents identifying as 'bingbot' if they come from a *.search.msn.com IP address, and 'Googlebot' can come from addresses matching either *.google.com or *.googlebot.com.
In doing this, Apache "performs a double reverse DNS lookup on the client IP address", meaning that it does a reverse DNS lookup followed by a forward lookup:
% host 40.77.167.122
122.167.77.40.IN-ADDR.ARPA domain name pointer msnbot-40-77-167-122.search.msn.com.
% host msnbot-40-77-167-122.search.msn.com
msnbot-40-77-167-122.search.msn.com has address 40.77.167.122
If you want to be a bit less strict, and faster, you can replace Require host with Require forward-dns (available since 2.4.19) in which case only the initial reverse DNS lookup will be performed. The problem then being that a bot owner could fake their reverse DNS putting you back where you started.
If a reverse lookup of the IP address returns 3(NXDOMAIN) it may be a problem with your DNS resolvers, or due to a DNSSEC issue (see below).
Microsoft's DNSSEC problem
Starting in August 2020 some Microsoft addresses have started disappearing from DNS resolvers due do a misconfiguration at their end. This applies in particular to the 40.77.167.0/24 subnet.
If you are using Require host .search.msn.com this can result in bingbot spiders from the affected subnet being blocked, but switching to forward-dns may let them through.
Support for other search spiders
These entries are correct at time of writing, but do sometimes change:
Yahoo!
<If "%{HTTP_USER_AGENT} =~ /Yahoo! Slurp/">
Require host .crawl.yahoo.net
</If>
Yandex
<If "%{HTTP_USER_AGENT} =~ /Yandex(Bot|Image)/">
Require host .spider.yandex.com
</If>
Baidu
<If "%{HTTP_USER_AGENT} =~ /Baiduspider/">
Require host .crawl.baidu.com
</If>
References
- Apache 2.4 Access Control
- Apache Module mod_authz_host
- Verifying Googlebot and other Google crawlers
- Verifying Bingbot
Related Articles - Apache 2.4
- System Access authorization in Apache 2.4
- System Authenticating Apache Logins with PostgreSQL
- System Apache authorization with dynamic DNS
- System Blocking Fake Googlebot and bingbot spiders
- System Re-naming vhost files to *.conf for Apache 2.4