PHP: Parsing HTML files with DOMDocument and DOMXpath

Tweet 0 Shares 0 Tweets 2 Comments

The DOMDocument PHP class allows us to take an HTML file or HTML text input and convert it into an object that can be easily traversed and queried similar to the way things are done in JavaScript.

Sample input

For the following examples we're working with a text input imported using the loadHTML() method, but you can just as easily import a local or remote HTML file using loadHTMLFile() instead.

The HTML is as follows, and we're aiming to extract the links and text just from the H2 elements inside the .blogArticle sections of the page - the highlighted text below - and ignore all other links:

<?PHP
  $htmlinput = <<<EOT

<a href="#content">skip to content</a>

<div id="content">

<h1>H1 Heading</h1>

<p>Introductory text <a href="intro-link1.html">link1</a> and <a href="intro-link2.html">link2</a>.</p>

<div class="blogArticle">
<h2><a href="article1.html">Article #1 Title</a></h2>
<p>Introductory text ... <a href="article1.html">more &raquo;</a></p>
</div>

<a href="#top">Top</a>

<div class="blogArticle">
<h2><a href="article2.html">Article #2 Title</a></h2>
<p>Introductory text ... <a href="article2.html">more &raquo;</a></p>
</div>

<a href="#top">Top</a>

<div class="blogArticle">
<h2><a href="article3.html">Article #3 Title</a></h2>
<p>Introductory text ... <a href="article3.html">more &raquo;</a></p>
</div>

<a href="#top">Top</a>

<div class="blogArticle">
<h2><a href="article4.html">Article #4 Title</a></h2>
<p>Introductory text ... <a href="article4.html">more &raquo;</a></p>
</div>

<a href="#top">Top</a>

<p>Footer text <a href="footer-link.html">link</a>.</p>

</div>

<p><a href="copyright.html">Copyright &copy; 2014</a></p>

EOT;
?>

This task would be trivial using regular expressions, but in more complicated situations the DOM approach has certain advantages.

Finding all links in the document

To find and extract all links from an HTML document we use the getElementsByTagName method which we're familiar with from JavaScript:

<?PHP
  $doc = new \DOMDocument();
  $doc->loadHTML($htmlinput);

  $links = [];

  // all links in document

  $arr = $doc->getElementsByTagName("a"); // DOMNodeList Object

  foreach($arr as $item) { // DOMElement Object
    $href =  $item->getAttribute("href");
    $text = trim(preg_replace("/[\r\n]+/", " ", $item->nodeValue));
    $links[] = [
      'href' => $href,
      'text' => $text
    ];
  }
?>

In this case all 17 links in the HTML are returned.

You'll notice that we've prefixed DOMDocument, and later DOMXpath, with a \. This is to make the code compatible with PHP namespaces. The alternative is to use use.

A slight improvement is to identify a containing element, in this case #content, and restrict the search that way making use of the getElementById method - also identical to it's JavaScript counterpart:

<?PHP
  $doc = new \DOMDocument();
  $doc->loadHTML($htmlinput);

  $links = [];

  // all links in #content

  $container = $doc->getElementById("content");
  $arr = $container->getElementsByTagName("a");

  foreach($arr as $item) {
    $href =  $item->getAttribute("href");
    $text = trim(preg_replace("/[\r\n]+/", " ", $item->nodeValue));
    $links[] = [
      'href' => $href,
      'text' => $text
    ];
  }
?>

This now excludes any links outside of the #content container, leaving us with 15 links.

getElementsByClassName equivalent

There is no actual getElementsByClassName (yet) in DOMDocument, but the same results can be produced using DOMXpath as follows:

<?PHP
  $doc = new \DOMDocument();
  $doc->loadHTML($htmlinput);

  $xpath = new \DOMXpath($doc);
  $articles = $xpath->query('//div[@class="blogArticle"]');

  $links = [];

  // all links in .blogArticle

  foreach($articles as $container) {

    $arr = $container->getElementsByTagName("a");

    foreach($arr as $item) {
      $href =  $item->getAttribute("href");
      $text = trim(preg_replace("/[\r\n]+/", " ", $item->nodeValue));
      $links[] = [
        'href' => $href,
        'text' => $text
      ];
    }

  }
?>

Whereas in the previous example we searched for links in #content - a single element - we're now searching for links within multiple .blogArticle sections of the page.

The most complicated element here is the DOMXpath query //div[@class="blogArticle"], which targets all DIV elements having a className of blogArticle. In cases where there are multiple or similar class names this will need refining.

When making DOMXpath queries within another element, start the query string with .// and pass the container node as the second argument. For example:

$xpath->query('.//div[@class="post-details"]', $container);

The final step

Now we need to single out just the links having an H2 as their parent:

<?PHP
  $doc = new \DOMDocument();
  $doc->loadHTML($htmlinput);

  $xpath = new \DOMXpath($doc);
  $articles = $xpath->query('//div[@class="blogArticle"]');

  $links = [];

  // all links in h2's in .blogArticle

  foreach($articles as $container) {

    $arr = $container->getElementsByTagName("a");

    foreach($arr as $item) {
      if($item->parentNode->tagName == "h2") {
        $href =  $item->getAttribute("href");
        $text = trim(preg_replace("/[\r\n]+/", " ", $item->nodeValue));
        $links[] = [
          'href' => $href,
          'text' => $text
        ];
      }
    }

  }
?>

Finally the result we're after. The $links array now returns just four links matching the four article headings. Looking back you can see that these match the highlighted text in the input HTML.

Array
(
    [0] => Array
        (
            [href] => article1.html
            [text] => Article #1 Title
        )

    [1] => Array
        (
            [href] => article2.html
            [text] => Article #2 Title
        )

    [2] => Array
        (
            [href] => article3.html
            [text] => Article #3 Title
        )

    [3] => Array
        (
            [href] => article4.html
            [text] => Article #4 Title
        )

)

An identical approach can be used to find images in HTML - searching for the IMG tag name and using getAttribute to extract the SRC and other attributes.

If you're planning to use this code to spider websites, you should also read our related article on reading and obeying robots.txt.

References

PHP.net: The DOMDocument class
PHP.net: The DOMXPath class

PHP Parsing HTML to find Links
PHP Parsing HTML files with DOMDocument and DOMXpath
PHP Listing files in a ZIP archive
PHP Parsing robots.txt
PHP Stripping invalid Unicode for pdfTeX

< PHP

User Comments

Post your comment or question

JS 17 October, 2016

You should edit your tutorial because right now none of your examples are working

The examples are working, which you can see from the output displayed as it's generated in real-time. If you're getting errors you should check that you're inputting valid HTML.

Andre 23 January, 2015

HI,

Doesnt work anymore: PHP Fatal error: Call to undefined method DOMAttr::getAttribute()

You may be missing a PHP package such as php-xml containing the DOM function library.