PHP: Parsing HTML to find Links

Tweet 0 Shares 0 Tweets 15 Comments

From blogging to log analysis and search engine optimisation (SEO) people are looking for scripts that can parse web pages and RSS feeds from other websites - to see where their traffic is coming from among other things.

Parsing your own HTML should be no problem - assuming that you use consistent formatting - but once you set your sights at parsing other people's HTML the frustration really sets in. This page presents some regular expressions and a commentary that will hopefully point you in the right direction.

Simplest Case

Let's start with the simplest case - a well formatted link with no extra attributes:

/<a href=\"([^\"]*)\">(.*)<\/a>/iU

This, believe it or not, is a very simple regular expression (or "regexp" for short). It can be broken down as follows:

starts with: <a href="
a series of characters up to, but not including, the next double-quote (") - 1st capture
the string: ">
a series of any characters - 2nd capture
ends with: </a>

We're also using two 'pattern modifiers':

i - matches are 'caseless' (upper or lower case doesn't matter)
U - matches are 'ungreedy'

The first modifier means that we're matching <A> as well as <a>. The 'ungreedy' modifier is necessary because otherwise the second captured string could (by being 'greedy') extend from the contents of one link all the way to the end of another link.

One shortcoming of this regexp is that it won't match link tags that include a line break - fortunately there's a modifer for this as well:

/<a\shref=\"([^\"]*)\">(.*)<\/a>/siU

Now the '.' character will match any character including line breaks. We've also changed the first space to a 'whitespace' character type so that it can match a space, tab or line break. It's necessary to have some kind of whitespace in that position so we don't match other tags such as <area>.

For more information on pattern modifiers see the link at the bottom of this page.

Room for Extra Attributes

Most link tags contain a lot more than just an href attribute. Other common attributes include: rel, target and title. They can appear before or after the href attribute:

/<a\s[^>]*href=\"([^\"]*)\"[^>]*>(.*)<\/a>/siU

We've added extra patterns before and after the href attribute. They will match any series of characters NOT containing the > symbol. It's always better when writing regular expressions to specify exactly which characters are allowed and not allowed - 0rather that using the wildcard ('.') character.

Allow for Missing Quotes

Up to now we've assumed that the link address is going to be enclosed in double-quotes. Unfortunately there's nothing enforcing this so a lot of people simply leave them out. The problem is that we were relying on the quotes to be there to indicate where the address starts and ends. Without the quotes we have a problem.

It would be simple enough (even trivial) to write a second regexp, but where's the fun in that when we can do it all with one:

/<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>/siU

What can I say? Regular expressions are a lot of fun to work with but when it takes a half-hour to work out where to put an extra ? your really know you're in deep.

Firstly, what's with those extra ?'s?

Because we used the U modifier, all patterns in the regexp default to 'ungreedy'. Adding an extra ? after a ? or * reverses that behaviour back to 'greedy' but just for the preceding pattern. Without this, for reasons that are difficult to explain, the expression fails. Basically anything following href= is lumped into the [^>]* expression.

We've added an extra capture to the regexp that matches a double-quote if it's there: (\"??). There is then a backreference \\1 that matches the closing double-quote - if there was an opening one.

To cater for links without quotes, the pattern to match the link address itself has been changed from [^\"]* to [^\" >]*?. That means that the link can be terminated by not just a double-quote (the previous behaviour) but also a space or > symbol.

This means that links with addresses containing unescaped spaces will no longer be captured!

Refining the Regexp

Given the nature of the WWW there are always going to be cases where the regular expression breaks down. Small changes to the patterns can fix these.

spaces around the `=` after href:

/<a\s[^>]*href\s*=\s*(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>/siU

matching only links starting with http:

/<a\s[^>]*href=(\"??)(http[^\" >]*?)\\1[^>]*>(.*)<\/a>/siU

single quotes around the link address:

/<a\s[^>]*href=([\"\']??)([^\" >]*?)\\1[^>]*>(.*)<\/a>/siU

And yes, all of these modifications can be used at the same time to make one super-regexp, but the result is just too painful to look at so I'll leave that as an exercise.

Note: All of the expressions on this page have been tested to some extent, but mistakes can occur in transcribing so please report any errors you may have found when implementing these examples.

Using the Regular Expression to parse HTML

Using the default for preg_match_all the array returned contains an array of the first 'capture' then an array of the second capture and so forth. By capture we mean patterns contained in ():

<?PHP
  // Original PHP code by Chirp Internet: www.chirpinternet.eu
  // Please acknowledge use of this code by including this header.

  $url = "http://www.example.net/somepage.html";
  $input = @file_get_contents($url) or die("Could not access file: $url");
  $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
  if(preg_match_all("/$regexp/siU", $input, $matches)) {
    // $matches[2] = array of link addresses
    // $matches[3] = array of link text - including HTML code
  }
?>

Using PREG_SET_ORDER each link matched has it's own array in the return value:

<?PHP
  // Original PHP code by Chirp Internet: www.chirpinternet.eu
  // Please acknowledge use of this code by including this header.

  $url = "http://www.example.net/somepage.html";
  $input = @file_get_contents($url) or die("Could not access file: $url");
  $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
  if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) {
    foreach($matches as $match) {
      // $match[2] = link address
      // $match[3] = link text
    }
  }
?>

If you find any cases where this code falls down, let us know using the Feedback link below.

Before using this or similar scripts to fetch pages from other websites, we suggest you read through the related article on setting a user agent and parsing robots.txt.

First checking robots.txt

As mentioned above, before using a script to download files you should always check the robots.txt file. Here we're making use of the robots_allowed function from the article linked above to determine whether we're allowed to access files:

<?PHP
  // Original PHP code by Chirp Internet: www.chirpinternet.eu
  // Please acknowledge use of this code by including this header.

  ini_set('user_agent', 'NameOfAgent (http://www.example.net)');

  $url = "http://www.example.net/somepage.html";
  if(robots_allowed($url, "NameOfAgent")) {
    $input = @file_get_contents($url) or die("Could not access file: $url");
    $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
    if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) {
      foreach($matches as $match) {
        // $match[2] = link address
        // $match[3] = link text
      }
    }
  } else {
    die('Access denied by robots.txt');
  }
?>

Now you're well on the way to building a professional web spider. If you're going to use this in practice you might want to look at: caching the robots.txt file so that it's not downloaded every time (a la Slurp); checking the server headers and server response codes; and adding a pause between multiple requests - for starters.

Translations

French

References

PHP.net: PCRE: Pattern Modifiers

PHP Parsing HTML files with DOMDocument and DOMXpath
PHP Parsing HTML to find Links
PHP Listing files in a ZIP archive
PHP Parsing robots.txt
PHP Stripping invalid Unicode for pdfTeX

< PHP

User Comments

Post your comment or question

Full-R 14 November, 2020

Thank you! I compare all RegExps and now it works fine.

preg_match_all("/<a\s[^>]*href\s*=\s*([\"\']??)([^\"\' >]*?)\\1[^>]*>(.*)<\/a>/siU", $html, $prelinks, PREG_SET_ORDER);

Alex 2 February, 2017

How would you do this but for images? For example <img src="file.png"> but take into consideration all of the variations such as in this example? It doesn't need an end tag so it may end with > or />

To find images with that amount of variation in the HTML you're better of using DOMDocument and DOMXpath.

Nattsurfaren 10 May, 2016

Thanks 'Martijn van der Lee'
You fixed my problem.

Martijn van der Lee 11 February, 2014

The regex for matching both double and single quotes does not properly match any double-quotes within the URL.

<a href='bl"ah'>foo</a> will incorrectly match 'bl
(yes, it's a dumb URL, but dumb people make dumb URL's)

The problem is the [^\" >] bit, it should backreference as such: [^\\1 >].

The complete regex should look like: /<a\s[^>]*href=([\"\']??)([^\\1 >]*?)\\1[^>]*>(.*)<\/a>/siU

dj-phaser 29 July, 2013

Hello, nice work. But I have a problem. The regular expression does not match the type of links: <a href="www.neco.cz/userfiles/Slevový coupon KH.pdf"> here </ a>. Can you please help regular expression modified to accept spaces in the link? Thanks.

In the section above "Allow for Missing Quotes" the changes to the regular expression mean that links with spaces aren't matched. If you leave out those specific changes then it will work.

Cary 24 January, 2012

If there is multi-line formatting within the tag the regex will not match.

<a href="example.com">; {newline}
{tabbed indent} <img src="example.com/image.png"/>;

Two changes were necessary for this:

1. reg_match_all("/$regexp/simU", $html, $matches, PREG_SET_ORDER) - add multi-line matching (m)
2. >s*(.*)s*</a> - use the regex to strip out the newlines and white space before capturing

Ron 12 May, 2011

Sometimes, every bit is critical, and you can't afford the extra RAM and CPU. Even if its just a small change. For example changing the file_get_contents to CURL, will save you on one of our serves typically 0.04-6 seconds, and use a bit less resource. Now, when checking 2000 pages for a waiting client, that adds up to about 80 seconds faster. That one step though tiny in some circumstances, makes a significant difference.

Having DOM v. regex might offer a simular resource time saving in heavy load environments. It's worthwhile knowing your regex in those situations.

Both very good points

Sean 14 December, 2010

hi there,

does anyone know how i would extract the title value from the link also please ?

what changes would i need to make to :

$regexp = "/<as[^>]*href=("??)(http[^" >]*?)1[^>]*>(.*)</a>/siU";

Many thanks !

It's not quite that simple. You don't know whether the title attribute is going to appear before or after the href so it can't be done in a single regular expression. You would have to apply a second regexp on the $matches array (first element) to detect and extract the title text if it's present.

mark 1959 24 November, 2010

Have to agree with the guy above. Regex may be quicker - moot point really though - but using the dom is much easier and more intuitive... especially if you find regex like a foreign language.

Khairil 30 September, 2010

Your regex is great.. however I need to get other attributes of links like class=, id= or rel=

Google HTML Parser 29 September, 2010

I think that PHP regular expressions are going to be faster than using the DOM structure.

Anyway, I found that using DOM is better and reliable to find all the URLs in the Google HTML. Besides, I use this for big projects and the CPU needed to parse this pages is very low, so, in most of the cases it isn't a problem.

Lauri Raittila 4 June, 2009

Surely regexp is faster. At least when you leave holes in it. The thing with classes etc for this is that you don't need to rewrite html parser, which is not a simple thing to do. Much better to learn to use something that is already tested
This won't work with your regexp:
<a href='example.com'>Link</a>
nor this
<a href=example
>Link</a>
(note line break instead space)
<a href="example.com" title=">">

There is propably many others as well.

Actually, the regexp presented here does work with those links, if you properly escape the title attribute. Just use the modification for 'single quotes around the link address' shown above. The DOM functions might do better in some extreme cases, but they won't be faster than regular expressions.

Kevin Waterson 15 May, 2009

Parsing HTML with regex is riddled with gotcha's and the look aheads and look behinds to accomplish this make it very slow. In PHP, this is better accomplished by using the build in DOM class.

I'm curious as to whether you've done any testing on this? Perl regular expressions are pretty fast and the DOM class would have to use something similar internally so I'd be surprised if it was any quicker...

Arek 12 May, 2009

Great article. Works fine for me
thx

Johnathan 6 August, 2008

If the link href contains a space, it gets loaded into the matches[2] array as a null element.

It's not possible to have a single regexp that allows for both the case where there are no quotes around the href and the case where the href can contain spaces. If it's your website any spaces in the href should be encoded using + or %20 to avoid this problem.