Treehouse

PHP IMDB Scraper

By on  

It's been quite a while since I've written a PHP grabber and the itch finally got to me. This time the victim is the International Movie Database, otherwise known as IMDB. IMDB has info on every movie ever made (or so it seems). Their HTML source code is easy to parse so this one was a piece of cake.

The PHP

//url
$url = 'http://www.imdb.com/title/tt0367882/';

//get the page content
$imdb_content = get_data($url);

//parse for product name
$name = get_match('/<title>(.*)<\/title>/isU',$imdb_content);
$director = strip_tags(get_match('/<h5[^>]*>Director:<\/h5>(.*)<\/div>/isU',$imdb_content));
$plot = get_match('/<h5[^>]*>Plot:<\/h5>(.*)<\/div>/isU',$imdb_content);
$release_date = get_match('/<h5[^>]*>Release Date:<\/h5>(.*)<\/div>/isU',$imdb_content);
$mpaa = get_match('/<a href="\/mpaa">MPAA<\/a>:<\/h5>(.*)<\/div>/isU',$imdb_content);
$run_time = get_match('/Runtime:<\/h5>(.*)<\/div>/isU',$imdb_content);

//build content
$content.= '<h2>Film</h2><p>'.$name.'</p>';
$content.= '<h2>Director</h2><p>'.$director.'</p>';
$content.= '<h2>Plot</h2><p>'.substr($plot,0,strpos($plot,'<a')).'</p>';
$content.= '<h2>Release Date</h2><p>'.substr($release_date,0,strpos($release_date,'<a')).'</p>';
$content.= '<h2>MPAA</h2><p>'.$mpaa.'</p>';
$content.= '<h2>Run Time</h2><p>'.$run_time.'</p>';
$content.= '<h2>Full Details</h2><p><a href="'.$url.'" rel="nofollow">'.$url.'</a></p>';

echo $content;

//gets the match content
function get_match($regex,$content)
{
	preg_match($regex,$content,$matches);
	return $matches[1];
}

//gets the data from a URL
function get_data($url)
{
	$ch = curl_init();
	$timeout = 5;
	curl_setopt($ch,CURLOPT_URL,$url);
	curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
	curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
	$data = curl_exec($ch);
	curl_close($ch);
	return $data;
}

As with my other grabbers, the trick is always in the regular expressions. Note that the most important part of the URL is the string after "/title/". That string uniquely identifies the movie.

ydkjs-5.png

Recent Features

  • Chris Coyier&#8217;s Favorite CodePen&nbsp;Demos

    David asked me if I'd be up for a guest post picking out some of my favorite Pens from CodePen. A daunting task! There are so many! I managed to pick a few though that have blown me away over...

  • Create Namespaced Classes with&nbsp;MooTools

    MooTools has always gotten a bit of grief for not inherently using and standardizing namespaced-based JavaScript classes like the Dojo Toolkit does.  Many developers create their classes as globals which is generally frowned up.  I mostly disagree with that stance, but each...

  • CSS&nbsp;Filters

    CSS filter support recently landed within WebKit nightlies. CSS filters provide a method for modifying the rendering of a basic DOM element, image, or video. CSS filters allow for blurring, warping, and modifying the color...

Incredible Demos

  • Introducing MooTools&nbsp;Dotter

    It's best practice to provide an indicator of some sort when performing an AJAX request or processing that takes place in the background. Since the dawn of AJAX, we've been using colorful spinners and imagery as indicators. While...

  • Do / Undo Functionality with&nbsp;MooTools

    We all know that do/undo functionality is a God send for word processing apps. I've used those terms so often that I think of JavaScript actions in terms of "do" an "undo." I've put together a proof of...

  • Fix Anchor URLs Using MooTools&nbsp;1.2

    The administrative control panel I build for my customers features FCKEditor, a powerful WYSIWYG editor that allows the customer to add links, bold text, create ordered lists, and so on. I provide training and documentation to the customers but...