Building Resilient Systems on AWS: Learn how to design and implement a resilient, highly available, fault-tolerant infrastructure on AWS.

Parse Web Pages with PHP Simple HTML DOM Parser

By David Walsh on June 15, 2011

For those of you who have had the pleasure of following me on Twitter (...), you probably know that I'm a complete soccer (football) fanatic. I even started a separate Twitter account to voice my footy musings. If you follow football yourself, you'll know that we've just started the international transfer window and there are a billion rumors about a billion players going to a billion clubs. It's enough to drive you mad but I simply HAVE TO KNOW who will be in the Arsenal and Liverpool first teams next season.

The problem I run into, besides all of the rubbish reports making waved, is that I don't have time to check every website on the hour. Twitter is a big help, but there's nothing better during this time than an official report from each club's website. To keep an eye on those reports, I'm using the power of PHP Simple HTML DOM Parser to write a tiny PHP script that shoots me an email whenever a specific page is updated.

PHP Simple HTML DOM Parser

PHP Simple HTML DOM Parser is a dream utility for developers that work with both PHP and the DOM because developers can easily find DOM elements using PHP. Here are a few sample uses of PHP Simple HTML DOM Parser:

// Include the library
include('simple_html_dom.php');
 
// Retrieve the DOM from a given URL
$html = file_get_html('https://davidwalsh.name/');

// Find all "A" tags and print their HREFs
foreach($html->find('a') as $e) 
    echo $e->href . '<br>';

// Retrieve all images and print their SRCs
foreach($html->find('img') as $e)
    echo $e->src . '<br>';

// Find all images, print their text with the "<>" included
foreach($html->find('img') as $e)
    echo $e->outertext . '<br>';

// Find the DIV tag with an id of "myId"
foreach($html->find('div#myId') as $e)
    echo $e->innertext . '<br>';

// Find all SPAN tags that have a class of "myClass"
foreach($html->find('span.myClass') as $e)
    echo $e->outertext . '<br>';

// Find all TD tags with "align=center"
foreach($html->find('td[align=center]') as $e)
    echo $e->innertext . '<br>';
    
// Extract all text from a given cell
echo $html->find('td[align="center"]', 1)->plaintext.'<br><hr>';

Like I said earlier, this library is a dream for finding elements, just as the early JavaScript frameworks and selector engines have become. Armed with the ability to pick content from DOM nodes with PHP, it's time to analyze websites for changes.

The Script

The following script checks two websites for changes:

// Pull in PHP Simple HTML DOM Parser
include("simplehtmldom/simple_html_dom.php");

// Settings on top
$sitesToCheck = array(
					// id is the page ID for selector
					array("url" => "http://www.arsenal.com/first-team/players", "selector" => "#squad"),
					array("url" => "http://www.liverpoolfc.tv/news", "selector" => "ul[style='height:400px;']")
				);
$savePath = "cachedPages/";
$emailContent = "";

// For every page to check...
foreach($sitesToCheck as $site) {
	$url = $site["url"];
	
	// Calculate the cachedPage name, set oldContent = "";
	$fileName = md5($url);
	$oldContent = "";
	
	// Get the URL's current page content
	$html = file_get_html($url);
	
	// Find content by querying with a selector, just like a selector engine!
	foreach($html->find($site["selector"]) as $element) {
		$currentContent = $element->plaintext;;
	}
	
	// If a cached file exists
	if(file_exists($savePath.$fileName)) {
		// Retrieve the old content
		$oldContent = file_get_contents($savePath.$fileName);
	}
	
	// If different, notify!
	if($oldContent && $currentContent != $oldContent) {
		// Here's where we can do a whoooooooooooooole lotta stuff
		// We could tweet to an address
		// We can send a simple email
		// We can text ourselves
		
		// Build simple email content
		$emailContent = "David, the following page has changed!\n\n".$url."\n\n";
	}
	
	// Save new content
	file_put_contents($savePath.$fileName,$currentContent);
}

// Send the email if there's content!
if($emailContent) {
	// Sendmail!
	mail("david@davidwalsh.name","Sites Have Changed!",$emailContent,"From: alerts@davidwalsh.name","\r\n");
	// Debug
	echo $emailContent;
}

The code and comments are self-explanatory. I've set the script up such that I get one "digest" alert if many of the pages change. The script is the hard part -- to enact the script, I've set up a CRON job to run the script every 20 minutes.

This solution isn't specific to just spying on footy -- you could use this type of script on any number of sites. This script, however, is a bit simplistic in all cases. If you wanted to spy on a website that had extremely dynamic code (i.e. a timestamp was in the code), you would want to create a regular expressions that would isolate the content to just the block you're looking for. Since each website is constructed differently, I'll leave it up to you to create page-specific isolators. Have fun spying on websites though...and be sure to let me know if you hear a good, reliable footy rumor!

Recent Features

By David WalshAugust 29, 2011
Create Namespaced Classes with MooTools
MooTools has always gotten a bit of grief for not inherently using and standardizing namespaced-based JavaScript classes like the Dojo Toolkit does. Many developers create their classes as globals which is generally frowned up. I mostly disagree with that stance, but each to their own. In any event...
By David WalshSeptember 19, 2011
Introducing MooTools Templated
One major problem with creating UI components with the MooTools JavaScript framework is that there isn't a great way of allowing customization of template and ease of node creation. As of today, there are two ways of creating: new Element Madness The first way to create UI-driven...

Incredible Demos

By David WalshMay 7, 2008
Flashy FAQs Using MooTools Sliders
I often qualify a great website by one that pay attention to detail and makes all of the "little things" seem as though much time was spent on them. Let's face it -- FAQs are as boring as they come. That is, until you...
By David WalshNovember 8, 2010
LightFace: Facebook Lightbox for MooTools
One of the web components I've always loved has been Facebook's modal dialog. This "lightbox" isn't like others: no dark overlay, no obnoxious animating to size, and it doesn't try to do "too much." With Facebook's dialog in mind, I've created LightFace: a Facebook lightbox...

Discussion

Kecs
Dear David,

It’s cool, but i think you could do it a better way!
If the target website has an advertise system, every time when you download it, you will get a different source code becase the advert image or string is different…

For this reason, i would like to offer the Simple Html Dom or other server side object, which could reach the real content of webpages by IDs! If you know what is the ID of content div then you can exclude the ads, and other unwanted elements for example last forum posts :)
And you could compare the real contents of webpages :)

Daniel

David Walsh
Outstanding suggestion Kecs! I’ll be making a major update to this website shortly.

Cheers!

noop
You’ve got two ending semi-colons in your example:

$currentContent = $element->plaintext;;

Thanks for the script. Very useful. :)
flies
Keep in mind that it’s very easy to cause a memory leak in this script. I’ve had this problem a few months back. It’s memory grews rapidly if you want to reuse your $html / $element variable. Let’s take for example $html variable.

If you use it like that:

foreach($file in $files) { $html = str_get_html($file); // do something bla bla bla }

you will get a huge memory leak very fast.

You must always keep in mind to clear your variable (just like that):

foreach($file in $files) { $html = str_get_html($file); // do something bla bla bla
$html->clear(); }
Lenny
I use this library extensively. I’m thinking of porting it to C#. The brother to this in C# uses Xpath.
tolgahan
Good share david you following facebook and twitter
janbomber
GER – VIN – HO … la la la lala

David Walsh
:D

Greg
I had no idea about this. Can’t wait to give it a try.

Thanks David!
Ryan
David: this is awesome! I had no idea this PHP library existed; it’ll make my current project (scraping microformats out of Lanyrd) much easier. Thanks for the easy-to-follow code!
BlaineSch
Being a JavaScript guy, I figured you’d go for something like phpQuery for parsing the HTML.

http://code.google.com/p/phpquery/

David Walsh
Wow, that looks interesting. I’ll have a look at that too!

Adam Meyer
This will be very nice next time I need to write a spider.
subtain
can any one tell me that how to parse the select option value….please write code for this
michel j
Hi, great website and examples!

I`ve been puzzling for over a week to somehow get data from a Google website into my online database table (mySQL). Here`s the thing:

The following website contains a table with historical stock data of AEGON:
http://www.google.com/finance/historical?q=AMS:AGN&start=0&num=160

I want to somehow download this table into a MYSQL table. It must get the same headers, so that would be: Date, Open, High, Low, Close, Volume. The PHP code should import all 160 lines of data.

I tried numerous php codes from the web, but always something was wrong.

any tips you have for me ;-)
THANKS!
Regards,
Michel

Awais
Hi Michel,

Recently i was tried the same, did you succeed? i want to save the Google result links into my database.

subtain
hi..Michel..yes you can pic all the data from this website but it is little bit difficult ….i can do this work for you.you can contact me on my email: subtain.fastian07@gmail.com
I hope you will be happy.
subtain
hi Michel me subtain for this you have to crawl the url of this website. and made a database of the same fields and then you have to write code for this to get all the data from this page to into your database and then you can use it for your piece of work.
Regards:Subtain Ali Tariq
subtain.fastian07@gmail.com
Daniel
The htmlDOM Parser don’t work in some sites like for example livingsocial[dot]com or Facebook

I’ts there any solution to this – I’m new to this but I think is something to do with the browser name (there is no browser name when fetching the website, so the website block the simple htmlDOM parser)

Sorry for my English
Daniel
Florin Nichifiriuc
@Daniel,
Please try BlaineSch suggestion http://code.google.com/p/phpquery/ since that can emulate the javascript as well.
saijin
On:
$url = 'http://www.google.com/search?hl=en&q=php&btnG=Search';
// Create DOM from URL $html = file_get_html($url);
// Match all 'A' tags that have the class attribute equal with 'l' foreach($html->find('a[class=l]') as $key => $info) { echo ($key 1).'. '.$info->plaintext."\n"; }

Can someone give me a hint on how can I output the data on each table cell.
Example:

Data 1 Data 2 Data 3 Data 4

Thanks in advance.
NIck
Thank you for sharing! I’ve been looking for something that functions like innertext. It’s not documented on any of the sites that I’ve visited including the homesite.
Spencer Williams
Thank you for sharing! I’ve been looking for something that features like innertext. It’s not recorded on any of the websites that I’ve frequented such as the homesite.
Darius
Awesome library, I made parsing demo http://dari.us.lt/demo/parser/ from page http://www.skelbikas.lt/ , really easy, thanks :)

Omer Rosenbaum
Would you be able to share the script?

Jones
Just though you might like to know this page ( http://davidwalsh.name/php-notifications ) displays poorly on my system using Firefox. I can not see the full width of the page until I open the window very wide, then the page shrinks in width to about 25% of the browser window. Pretty strange.
Rejitha
hi David,

Have been struggling with a problem related to simplehtmldom. hope atleast i will get an help from you.

I am using PHPword and simplehtmldom to convert html to docx format.

I am using fckeditor html output as the input for converting to docx. A single line break is reflected as multiple lines in the docx converted file.

How do i fix this ? The enter key used in fckeditor is adding a p tag in the html output. Is there any number of line breaks defined for a para tag. how do i change it or where do i change it.

Please help
david
I have tried accessing some news sites but I cannot access the information using the code.
Salvatore Capolupo
Great and useful tutorial, thanks very much!
David Pham
I tried to use for a wordpress plugin but not sure why it always return a empty object.
erata
hi. thanks for tutorial. i have taken mistakes they say
————————————————————————————————-..
Warning: file_get_contents(http://davidwalsh.name/): in C:\xampp\htdocs\simple_html_dom\simple_html_dom.php on line 75

Fatal error: Call to a member function find() on a non-object in C:\xampp\htdocs\simple_html_dom\index.php on line 9
—————————————————————————————————–
firstly i used simple html dom 1.5 version and then 1.1 but nothing changed.
i don’t find where is the problem. what should i do?

thanks for tutorial and offer.
kalloo
Thank you David , it save my time.Great work.Thank you very much…
hammad
To exclude ads we can use regex something like

preg_replace(‘/<script(.*)/s’,”,$variable);

joe hoeller

The WORKING version of my script is as follows, but I am trying to modify it to only run the loops on submit with PHP:

find('.article') as $element) {
        if($element->class == 'header2') {
             $ret['header2'] = $element->value;
        }
    }

    $ret['Glossary Term:'] = '' . $html->find('h2[class="header2"]', 0)->innertext . '';
    $ret['Definition:'] = '' . $html->find('.article p', 0)->innertext . '';

    return $ret;
}


$links = array (
‘http://www.ucla.edu/campus-life/la-lifestyle’,
‘http://www.ucla.edu/students/prospective-students’,
‘http://www.ucla.edu/’,
);

?>









	body { margin: 0 auto; width:960px; padding: 20px; border: 1px solid #ccc; }
	h2 { margin:0; pading:0; }
	p { margin-top:0; padding-top:0; }



	$v) {
        echo ''.$k.''.$v.'';
    }
}

azi
https cannot load by simple html dom why.? can we fix this.?
Noel Whitemore
Thank you David – a very helpful article.

azi – you can use other methods for retrieving HTML content over HTTPS, such as file_get_contents() or cURL.
Omer Rosenbaum
Any idea how do I add a search form to the script above when scraping a DIV from a webpage?
mazhar
The page i am trying to crawl has 4 html tables in it and I want the data in the 3rd table only, any idea on how to get this done?
Raheel
Hi,

I want to know that is there is only one method find ? I want to grab one elements value and insert it somewhere else in the dom as we usually do in jQuery. How can we do in this library ?
steve
why using a library when there are so much powerful tools in php like dom->xPath?? Can’t understand
sunil
Call to a member function find() on a non-object in C:\xampp\htdocs\scrapping\simplehtmldom\simple_html_dom.php on line 1113
error in inner loop
Mick
When I want to use DOM and ensure I have the speed when parsing the template I use the PhpDomTemplate lib (https://github.com/tropotek/tk-domtemplate). Its not a jQuery type DOM template but it is built to ensure that you are not re-iterating over the DOM tree on every search for a node and is fast when compared to using standard DOM as a template system.
Adnan
Hi, and thank for the useful article. But I encountered a serious error while getting the html of a website :
“Warning: file_get_contents(http://www.djazairess.com/): failed to open stream: HTTP request failed! HTTP/1.1 500 ACT in C:\xampp\htdocs\crawl\simple_html_dom.php on line 555″
I don’t know how to fix it.
Ellys
Actually, what is usefulness of this Simple HTML DOM library?

Parse Web Pages with PHP Simple HTML DOM Parser

PHP Simple HTML DOM Parser

The Script

Recent Features

Create Namespaced Classes with MooTools

Introducing MooTools Templated

Incredible Demos

Flashy FAQs Using MooTools Sliders

LightFace: Facebook Lightbox for MooTools

Discussion