PHP IMDB Scraper

By  on  

It's been quite a while since I've written a PHP grabber and the itch finally got to me. This time the victim is the International Movie Database, otherwise known as IMDB. IMDB has info on every movie ever made (or so it seems). Their HTML source code is easy to parse so this one was a piece of cake.

The PHP

//url
$url = 'http://www.imdb.com/title/tt0367882/';

//get the page content
$imdb_content = get_data($url);

//parse for product name
$name = get_match('/<title>(.*)<\/title>/isU',$imdb_content);
$director = strip_tags(get_match('/<h5[^>]*>Director:<\/h5>(.*)<\/div>/isU',$imdb_content));
$plot = get_match('/<h5[^>]*>Plot:<\/h5>(.*)<\/div>/isU',$imdb_content);
$release_date = get_match('/<h5[^>]*>Release Date:<\/h5>(.*)<\/div>/isU',$imdb_content);
$mpaa = get_match('/<a href="\/mpaa">MPAA<\/a>:<\/h5>(.*)<\/div>/isU',$imdb_content);
$run_time = get_match('/Runtime:<\/h5>(.*)<\/div>/isU',$imdb_content);

//build content
$content.= '<h2>Film</h2><p>'.$name.'</p>';
$content.= '<h2>Director</h2><p>'.$director.'</p>';
$content.= '<h2>Plot</h2><p>'.substr($plot,0,strpos($plot,'<a')).'</p>';
$content.= '<h2>Release Date</h2><p>'.substr($release_date,0,strpos($release_date,'<a')).'</p>';
$content.= '<h2>MPAA</h2><p>'.$mpaa.'</p>';
$content.= '<h2>Run Time</h2><p>'.$run_time.'</p>';
$content.= '<h2>Full Details</h2><p><a href="'.$url.'" rel="nofollow">'.$url.'</a></p>';

echo $content;

//gets the match content
function get_match($regex,$content)
{
	preg_match($regex,$content,$matches);
	return $matches[1];
}

//gets the data from a URL
function get_data($url)
{
	$ch = curl_init();
	$timeout = 5;
	curl_setopt($ch,CURLOPT_URL,$url);
	curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
	curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
	$data = curl_exec($ch);
	curl_close($ch);
	return $data;
}

As with my other grabbers, the trick is always in the regular expressions. Note that the most important part of the URL is the string after "/title/". That string uniquely identifies the movie.

Recent Features

  • By
    Chris Coyier&#8217;s Favorite CodePen Demos

    David asked me if I'd be up for a guest post picking out some of my favorite Pens from CodePen. A daunting task! There are so many! I managed to pick a few though that have blown me away over the past few months. If you...

  • By
    Conquering Impostor Syndrome

    Two years ago I documented my struggles with Imposter Syndrome and the response was immense.  I received messages of support and commiseration from new web developers, veteran engineers, and even persons of all experience levels in other professions.  I've even caught myself reading the post...

Incredible Demos

  • By
    Generate Dojo GFX Drawings from SVG Files

    One of the most awesome parts of the Dojo / Dijit / DojoX family is the amazing GFX library.  GFX lives within the dojox.gfx namespace and provides the foundation of Dojo's charting, drawing, and sketch libraries.  GFX allows you to create vector graphics (SVG, VML...

  • By
    Create a Photo Stack Effect with Pure CSS Animations or MooTools

    My favorite technological piece of Google Plus is its image upload and display handling.  You can drag the images from your OS right into a browser's DIV element, the images upload right before your eyes, and the albums page displays a sexy photo deck animation...