skip to content

System: Blocking Unwanted Spiders and Scrapers

If you're concerned about bandwidth, server resources, or just trying protect your content from automated scrapers then you should realise that it's not a fight that can be won. Having said that, here's a case study on how to recognise and block unwanted user agents from accessing your website, using mod_rewrite and an .htaccess file.

Recognising the Enemy

Let's start by scanning the logs to find out which IP addresses are making the most request.

awk '{print $1}' combined_log | sort | uniq -c | sort -n | tail -40

Note: Replace combined_log with the location of your actual combined-log file, or use *combined_log to process all at once.

This returns a list of the top 40 IP addresses in terms of the number of requests. We'll stick to the top 5 for now:

824 XX.173.68.90 956 XX.184.192.199 983 XXX.9.3.10 1068 XXX.231.187.166 1098 XXX.235.117.192

The number on the right is obviously an IP address. The other numbers are the total number of requests made from each address. We can make this a bit more informative by displaying the domain associated with each address:

awk '{print $1}' combined_log | sort | uniq -c | sort -n | tail -40 \ | awk '{print $2,$2,$1}' | logresolve | awk '{printf "%6d %s (%s)\n",$3,$1,$2}'

Note: you'll need to know where the logresolve function is located on your server, and be able to call it.

This returns the same list, but with the IP addresses also being resolved into domain names:

824 (XX.173.68.90) 957 (XX.184.192.199) 983 (XXX.9.3.10) 1068 (XXX.231.187.166) 1098 (XXX.235.117.192)

Now you might recognise some of the heavy-hitters. Some of them might be you, or a server script that runs periodically over the site (checking links or building a search index for example) or a legitimate search engine spider (Googlebot, msnbot, Slurp, ...). You should be able to safely ignore them.

With the others, let's see what they're up to.

Divide and Conquer

Firstly, let's confirm the number of requests, and whether they're accessing a single or multiple sites. We start with the IP address with the most requests.

grep -c XXX.235.117.192 *combined_log | grep -v \:0

Note: This is only going to be useful if you're scanning multipe logfiles.

In our case, the top IP address was only accessing a single site. That probably means that they have a particular interest in, and may in fact be the owner or regular user of that site. You should check this out before taking any action.

The second IP address turned out to be accessing multiple, unrelated websites. This merits further investigation so see if they're a legitimate spider or something less wholesome.

Let's see exactly what they're after:

grep XXX.231.187.166 combined_log | more

Things you should be looking for now:

  • Timing of requests - regular, random, sporadic, ...
  • Pages requested - small sample, random sample, systematic, pages only, images only, ...
  • Server Status Codes - 200, 301, 401, 403, 404, ...
  • User Agent - none, browser, search engine, spider, ...

For this IP address we see a typical spidering pattern, but no requests for robots.txt, no pause between requests and they triggered some 401 Unauthorised responses. Their user agent is always "Java/1.4.1_04".

The other three IP addresses returned similar results. The only difference being that the user agent changed slightly. We've made an 'executive decision' that they're all going to be blocked.

Turning Back the Tide

The following lines added to your .htacces file will block any requests coming from IP addresses starting with XXX.23 or XX.1 where the User Agent starts with Java:

# anonymous Java-based spiders RewriteCond %{REMOTE_HOST} ^XXX\.23[0-9] [OR] RewriteCond %{REMOTE_HOST} ^XX\.1[0-9][0-9] RewriteCond %{HTTP_USER_AGENT} ^Java RewriteRule .* - [F]

We could also have listed the IP addresses one at a time, but other tests showed a number of 'Java' user agents coming from the same ISP, or at least the same IP-block.

You need to be a little bit careful here. Sometimes an IP address belongs to a proxy server which means that an entire organisation or thousands of subscribers to an ISP could be affected if your rules are too indiscriminate. Unless you're certain that an IP address or IP-block is only being used by malcontents then try to always add a User Agent RequestCond so you don't end up blocking legitimate users.

Repairing the Breach

Having turned up a number of 'Java' agents that we decided to block, we might want to investigate other request with similar user agents. For example, user agents starting with Java:

awk -F\" '($6 ~ /^Java/)' combined_log | awk '{print $1}' | sort | uniq -c | sort -n

This returns a list of IP addresses similar to those above. The addresses we've already picked up will appear at the top, and you'll also see the less-active ones and have the option to investigate further.

As we mentioned at the start of this page, you're never going to be block all the 'bad' agents while letting in the good ones. There are simply too many possible variations. There are automated solutions for blocking IP addresses on a temporary or permanent basis based on behaviour, but that's a whole different ball-game.

< System

User Comments

Post your comment or question

14 August, 2014

Thanks for your reply to my question. Much appreciated.

Running with that example, I created a script that works for me. Not sure how elegant it is, but I'll share it in case someone else needs it:

TOP40=$(awk '{print $1}' $LOGFILE | sort | uniq -c | sort -rn | head -40 | awk '{print $2,$1}')
echo "$TOP40" | while read line
IFS=" "
set $line
grep $IP $LOGFILE | tail -1 | awk -v ip="$IP" -v score="$SCORE" '{print ip,ip,substr($4,2),score}' | logresolve | awk '{printf "%6d %s (%s) %s\n",$4,$1,$2,$3}'

13 August, 2014

How would you include the date & time of last visit in the command from Section 1? I can awk the log file to list the date & time like this:

awk '{print substr($4,2)}' combined_log

But I can't figure out how to get that included in the list of the top 40 visitors (as per your example).

It can't be done in a single command. You would need to run a loop over
the ip addresses. This might help:

TOP40=$(awk '{print $1}' $LOGFILE | sort | uniq -c | sort -n | tail -40 | awk '{print $2}')
for ip in $TOP40; do
grep $ip $LOGFILE | tail -1 | awk -v var="$ip" '{print var,substr($4,2)}'

20 November, 2013

A nicer way to block web sites is using ipsets and iptables. You have an iptables rule a bit like this

ipset create crawlers hash:ip
ipset add crawlers XX.173.68.90
ipset add crawlers XX.184.192.199
ipset add crawlers XXX.9.3.10
ipset add crawlers XXX.231.187.166
ipset add crawlers XXX.235.117.192
iptables -A INPUT -p tcp --dport 80 --match set --set-name crawlers src -j REJECT

This has the advantage that it can be modified on the fly from the command line, without having to restart Apache.

Thanks. I'll have to check out 'ipset'. We mostly just use Fail2Ban.

12 May, 2011

Nice post, but you should use
uniq -c | sort -n | tail -40
instead of
uniq -c | sort | tail -40
to do numeric sorting.

Updated now. Thanks