PHP: Parsing robots.txt

Tweet 0 Shares 0 Tweets 6 Comments

If you're writing any kind of script that involves fetching HTML pages or files from another server you really need to make sure that you follow netiquette - the "unofficial rules defining proper behaviour on Internet".

This means that your script needs to:

identify itself using the User Agent string including a URL;
check the sites robots.txt file to see if they want you to have access to the pages in question; and
not flood their server with too-frequent, repetitive or otherwise unnecessary requests.

If you don't meet these requirements then don't be surprised if they retaliate by blocking your IP address and/or filing a complaint. This article presents methods for achieving the first two goals, but the third is up to you.

Setting a User Agent

Before using any of the PHP file functions on a remote server you should decide on and set a sensible User Agent string. There are no real restrictions on what this can be, but some commonality is beginning to emerge.

The following formats are widely recognised:

www.example.net
NameOfAgent (http://www.example.net)
NameOfAgent/1.0 (http://www.example.net/bot.html)
NameOfAgent/1.1 (link checker; http://www.example.net/bot.html)
NameOfAgent/2.0 (link checker; http://www.example.net; webmaster@example.net)
...

The detail you provide should be proportionate to the amount of activity you're going to generate on the targeted sites/servers. The NameOfAgent value should be chosen with care as there are a lot of established user agents and you don't want to have to change this later. Check your server log files and our directory of user agents for examples.

Once you've settled on a name, using it is as simple as adding the following line to the start of your script:

<?PHP
  ini_set('user_agent', 'NameOfAgent (http://www.example.net)');
?>

By passing a User Agent string with all requests you run less risk of your IP address being blocked, but you also take on some extra responsibility. People will want to know why your script is accessing their site. They may also expect it to follow any restrictions defined in their robots.txt file...

Parsing robots.txt

That brings us to the purpose of this article - how to fetch and parse a robots.txt file.

The following script is useful if you only want to fetch one or two pages from a site (to check for links to your site for example). It will tell you whether a given user agent can access a specific page.

For a search engine spider or a script that intends to download a lot of files you should implement a cacheing mechanism so that the robots.txt file only needs to be fetched once every day or so.

<?PHP
  namespace Chirp;

  // Original PHP code by Chirp Internet: www.chirpinternet.eu
  // Please acknowledge use of this code by including this header.

  function robots_allowed($url, $useragent = FALSE)
  {

    // build array of valid user agents
    $agents = [
      preg_quote('*')
    ];
    if($useragent) {
      $agents[] = preg_quote($useragent);
    }
    $agents = implode('|', $agents);

    // parse url to retrieve scheme and hostname
    $parsed = parse_url($url);

    // location of robots.txt file
    $curl_target = "{$parsed['scheme']}://{$parsed['host']}/robots.txt";

    $curl_opts = [
      CURLOPT_FOLLOWLOCATION => TRUE,
      CURLOPT_USERAGENT => $useragent,
    ];

    // fetch robots.txt file
    $robotstxt = http_get_contents($curl_target, $curl_opts);

    // if there isn't a robots.txt then we're allowed in
    if(empty($robotstxt)) {
      return TRUE;
    }

    $rules = [];

    // first line of robots.txt file
    $line = strtok($robotstxt, "\r\n");

    while(FALSE !== $line) {

      $rule_applies = FALSE;

      // skip blank lines
      if(!$line = trim($line)) continue;

      // following rules only apply if User-agent matches $useragent or '*'
      if(preg_match('/^\s*User-agent: (.*)/i', $line, $match)) {
        $rule_applies = preg_match("/($agents)/i", $match[1]);
      }

      if($rule_applies && preg_match('/^\s*Disallow:(.*)/i', $line, $regs)) {
        // an empty rule implies full access - no further tests required
        if(!$regs[1]) return TRUE;
        // add rules that apply to array for testing
        $rules[] = preg_quote(trim($regs[1]), "/");
      }

      // next line of robots.txt file
      $line = strtok("\r\n");

    }

    foreach($rules as $rule) {
      // check if page is disallowed to us
      if(preg_match("/^$rule/", $parsed['path'])) {
        return FALSE;
      }
    }

    // page is not disallowed
    return TRUE;

  }

This script is designed to parse a well-formed robots.txt file with no in-line comments. Each call to the script will result in the robots.txt file being downloaded again. A missing robots.txt file or a Disallow statement with no argument will result in a return value of TRUE granting access.

We have recently rewritten this code to remove file and file_get_contents which are blocked now on many PHP servers replacing them with our own http_get_contents function. We have also enabled following of redirects (CURLOPT_FOLLOWLOCATION).

The script can be called as follows:

$canaccess = robots_allowed("http://www.example.net/links.php");
$canaccess = robots_allowed("http://www.example.net/links.php", "NameOfAgent");

or, in practice:

<?PHP
  $url = "http://www.example.net/links.php";

  $useragent = "NameOfAgent";

  $curl_opts = [
    CURLOPT_FOLLOWLOCATION => TRUE,
    CURLOPT_USERAGENT => $useragent,
  ];

  $curlinfo = [];

  if(\Chirp\robots_allowed($url, $useragent)) {
    // access granted

    $page_contents = http_get_contents($url, $curl_opts, $curlinfo);

    echo "<p>({$curlinfo['http_code']}) {$curlinfo['url']}</p>\n";

    if(200 == $curlinfo['http_code']) {
      // success - we have the page contents
    } else {
      // error - we don't have the page
    }

  } else {
    // access disallowed
    die("Access disallowed by robots.txt: {$url}");
  }
?>

If you don't pass a value for the second parameter then the script will only check for global rules - those under '*' in the robots.txt file. If you do pass the name of an agent then the script also finds and applies rules specific to that agent.

For more information on the robots.txt file see the links below.

Allowing for the Allow directive

The following modified code has been supplied by Eric at LinkUp.com. It fixes a bug where a missing (404) robots.txt file would result in the false return value. It also adds extra code to cater for the Allow directive now recognised by some search engines.

The 404 checking requires the cURL module to be compiled into PHP and we haven't tested ourselves the Allow directive parsing, but I'm sure it works. Please report any transcription errors.

<?PHP
  // Original PHP code by Chirp Internet: www.chirpinternet.eu
  // Adapted to include 404 and Allow directive checking by Eric at LinkUp.com
  // Please acknowledge use of this code by including this header.

  function robots_allowed($url, $useragent=false)
  {
    // parse url to retrieve host and path
    $parsed = parse_url($url);

    $agents = [preg_quote('*')];
    if($useragent) {
      $agents[] = preg_quote($useragent, '/');
    }
    $agents = implode('|', $agents);

    // location of robots.txt file, only pay attention to it if the server says it exists
    if(function_exists('curl_init')) {
      $handle = curl_init("http://{$parsed['host']}/robots.txt");
      curl_setopt($handle,  CURLOPT_RETURNTRANSFER, TRUE);
      $response = curl_exec($handle);
      $httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
      if(200 == $httpCode) {
        $robotstxt = explode("\n", $response);
      } else {
        $robotstxt = FALSE;
      }
      curl_close($handle);
    } else {
      $robotstxt = @file("http://{$parsed['host']}/robots.txt");
    }

    // if there isn't a robots, then we're allowed in
    if(empty($robotstxt)) {
      return true;
    }

    $rules = [];
    $rule_applies = FALSE;

    foreach($robotstxt as $line) {
      // skip blank lines
      if(!$line = trim($line)) continue;

      // following rules only apply if User-agent matches $useragent or '*'
      if(preg_match('/^\s*User-agent: (.*)/i', $line, $match)) {
        $rule_applies = preg_match("/($agents)/i", $match[1]);
        continue;
      }

      if($rule_applies) {
        list($type, $rule) = explode(':', $line, 2);
        $type = trim(strtolower($type));
        // add rules that apply to array for testing
        $rules[] = [
          'type' => $type,
          'match' => preg_quote(trim($rule), '/'),
        ];
      }

    }

    $isAllowed = TRUE;
    $currentStrength = 0;

    foreach($rules as $rule) {
      // check if page hits on a rule
      if(preg_match("/^{$rule['match']}/", $parsed['path'])) {
        // prefer longer (more specific) rules and Allow trumps Disallow if rules same length
        $strength = strlen($rule['match']);
        if($currentStrength < $strength) {
          $currentStrength = $strength;
          $isAllowed = ("allow" == $rule['type']);
        } elseif($currentStrength == $strength && ("allow" == $rule['type'])) {
          $currentStrength = $strength;
          $isAllowed = TRUE;
        }
      }
    }

    return $isAllowed;
  }

expand code box

Another option for the last section might be to first sort the $rules by length and then only check the longest ones for an Allow or Disallow directive as they will override any shorter rules.

Previously robots.txt could only be used to Disallow spiders from accessing specific directories, or the whole website. The Allow directive allows you to then grant access to specific subdirectories that would otherwise be blocked by Disallow rules.

You should be careful using this, however, as it's not part of the original standard and not all search engines will understand. On the other hand, if you're running a web spider, taking Allow rules into account will give you access to more pages.

References

PHP Parsing HTML to find Links
PHP Parsing HTML files with DOMDocument and DOMXpath
PHP Listing files in a ZIP archive
PHP Parsing robots.txt
PHP Stripping invalid Unicode for pdfTeX

< PHP

User Comments

Post your comment or question

Hugo Maugey 8 April, 2017

Hey,

I've just released a new library to do robots.txt policy checker. You can have a look at it here : github.com/hugsbrugs/php-robots-txt

LeMoussel 25 April, 2016

Replace '$' by ".*$" is not correct.
the dollar sign ($) match the end of the string.
For example, to block URLs that end with .asp:
Disallow: /*.asp$

Yuriy 4 November, 2014

Your way for make regex rules is absolutely incorrect. You use preg_quote, which add slashes for * and $
I've make try to fix it:

$rule = addcslashes(trim($regs[1]), "/\+?[^](){}=!<>|:-");
$rule = str_replace("*", ".*", $rule);
if ($rule[mb_strlen($rule)-1] == '$')
$rule = rtrim($rule, '$') . ".*$";
else
$rule .= ".*";
Please reply me to email if i'm wrong.

p.s.: sorry for my english, im russian.

Oukiva 7 September, 2014

Hi. Thank for the Parse-Robots article.

Please note that this parsing code does Not work with User-agent groups, ie several UA with a same group of Disallow.

User-agent: Googlebot
User-agent: bingbot
Disallow: /private/

This is "standard" for robots.txt. See www.robotstxt.org/orig.html#format or developers.google.com/webmasters/control-crawl-index/docs/robots_txt

unknown 9 September, 2013

Great script thanks!

I ran into some trouble when there wasn't a / on the end of a URL.

so for example www.test.net/css

was returning true when /css/ was actually forbidden in robots.txt.

This code fixed it...

//if the path part (/css) doesn't end in /
if(substr($parsed['path'], -1) != "/"){
//whack a / on the end
$parsed['path'] = $parsed['path'] . "/";
};

I'm sure it's not bullet proof, but it worked for my tiny example.
Great code

Pjenis 4 March, 2013

I found an error, very important, because wikipedia use this, check out de.wikipedia.org/robots.txt, in line 11 the use the '*' for applying it to all Google-Ads bots, your script recognize this as a 'User-Agent: *' Line,
Nice script anyway,
Greets

Interesting. I can't find any evidence that "Mediapartners-Google*" is actually a valid entry in robots.txt for the "User-agent" line.

The original robots.txt protocal recommends a "case insensitive substring match of the name without version information" so the asterisk serves no purpose. The valid use of '*' is to match all user agents.

Trying to block "Mediapartners-Google" is in any case pointless as that user agent only visits websites that display AdSense ads, and then it's required to give it access.

For the robots.txt parser to cater for this I would either strip out the '*' when it follows or precedes other characters after User-agent, or ignore those rules completely.