skip navigation

PHP: Parsing robots.txt

If you're writing any kind of script that involves fetching HTML pages or files from another server you really need to make sure that you follow netiquette - the "unofficial rules defining proper behaviour on Internet".

This means that your script needs to:

  1. identify itself using the User Agent string including a URL;
  2. check the site's robots.txt file to see if they want you to have access to the pages in question; and
  3. not flood their server with too-frequent, repetitive or otherwise unnecessary requests.

If you don't meet these requirements then don't be surprised if they retaliate by blocking your IP address and/or filing a complaint. This article presents methods for achieving the first two goals, but the third is up to you.

Setting a User Agent

Before using any of the PHP file functions on a remote server you should decide on and set a sensible User Agent string. There are no real restrictions on what this can be, but some commonality is beginning to emerge.

The following formats are widely recognised:

  • www.example.net
  • NameOfAgent (http://www.example.net)
  • NameOfAgent/1.0 (http://www.example.net/bot.html)
  • NameOfAgent/1.1 (link checker; http://www.example.net/bot.html)
  • NameOfAgent/2.0 (link checker; http://www.example.net; webmaster@example.net)
  • ...

The detail you provide should be proportionate to the amount of activity you're going to generate on the targeted sites/servers. The NameOfAgent value should be chosen with care as there are a lot of established user agents and you don't want to have to change this later. Check your server log files and our directory of user agents for examples.

Once you've settled on a name, using it is as simple as adding the following line to the start of your script:

ini_set('user_agent', 'NameOfAgent (http://www.example.net)');

By passing a User Agent string with all requests you run less risk of your IP address being blocked, but you also take on some extra responsibility. People will want to know why your script is accessing their site. They may also expect it to follow any restrictions defined in their robots.txt file...

Parsing robots.txt

That brings us to the purpose of this article - how to fetch and parse a robots.txt file.

The following script is useful if you only want to fetch one or two pages from a site (to check for links to your site for example). It will tell you whether a given user agent can access a specific page.

If you're building a search engine spider or intend to download a lot of files then you should implement a cacheing mechanism so that the robots.txt file only needs to be fetched once every day or so.

# Original PHP code by Chirp Internet: www.chirp.com.au # Please acknowledge use of this code by including this header. function robots_allowed($url, $useragent=false) { # parse url to retrieve host and path $parsed = parse_url($url); $agents = array(preg_quote('*')); if($useragent) $agents[] = preg_quote($useragent); $agents = implode('|', $agents); # location of robots.txt file $robotstxt = @file("http://{$parsed['host']}/robots.txt"); if(!$robotstxt) return true; $rules = array(); $ruleapplies = false; foreach($robotstxt as $line) { # skip blank lines if(!$line = trim($line)) continue; # following rules only apply if User-agent matches $useragent or '*' if(preg_match('/User-agent: (.*)/i', $line, $match)) { $ruleapplies = preg_match("/($agents)/i", $match[1]); } if($ruleapplies && preg_match('/Disallow:(.*)/i', $line, $regs)) { # an empty rule implies full access - no further tests required if(!$regs[1]) return true; # add rules that apply to array for testing $rules[] = preg_quote(trim($regs[1]), '/'); } } foreach($rules as $rule) { # check if page is disallowed to us if(preg_match("/^$rule/", $parsed['path'])) return false; } # page is not disallowed return true; }

Note: This script is designed to parse a well-formed robots.txt file with no in-line comments. Each call to the script will result in the robots.txt file being downloaded again. A missing robots.txt file or a Disallow statement with no argument will result in a return value of true.

The script can be called as follows:

$canaccess = robots_allowed("http://www.example.net/links.php"); $canaccess = robots_allowed("http://www.example.net/links.php", "NameOfAgent");

or, in practice:

$url = "http://www.example.net/links.php"; if(robots_allowed($url, "NameOfAgent")) { # access granted $tmp = file_get_contents($url); } else { # access disallowed }

If you don't pass a value for the second parameter then the script will only check for global rules - those under '*' in the robots.txt file. If you do pass the name of an agent then the script also finds and applies rules specific to that agent.

For more information on the robots.txt file see the links below.

Related Articles

References

[Back to PHP]


Bookmark and Share

[top]