PHP IMDB Scraper

By  on  

It's been quite a while since I've written a PHP grabber and the itch finally got to me. This time the victim is the International Movie Database, otherwise known as IMDB. IMDB has info on every movie ever made (or so it seems). Their HTML source code is easy to parse so this one was a piece of cake.

The PHP

//url
$url = 'http://www.imdb.com/title/tt0367882/';

//get the page content
$imdb_content = get_data($url);

//parse for product name
$name = get_match('/<title>(.*)<\/title>/isU',$imdb_content);
$director = strip_tags(get_match('/<h5[^>]*>Director:<\/h5>(.*)<\/div>/isU',$imdb_content));
$plot = get_match('/<h5[^>]*>Plot:<\/h5>(.*)<\/div>/isU',$imdb_content);
$release_date = get_match('/<h5[^>]*>Release Date:<\/h5>(.*)<\/div>/isU',$imdb_content);
$mpaa = get_match('/<a href="\/mpaa">MPAA<\/a>:<\/h5>(.*)<\/div>/isU',$imdb_content);
$run_time = get_match('/Runtime:<\/h5>(.*)<\/div>/isU',$imdb_content);

//build content
$content.= '<h2>Film</h2><p>'.$name.'</p>';
$content.= '<h2>Director</h2><p>'.$director.'</p>';
$content.= '<h2>Plot</h2><p>'.substr($plot,0,strpos($plot,'<a')).'</p>';
$content.= '<h2>Release Date</h2><p>'.substr($release_date,0,strpos($release_date,'<a')).'</p>';
$content.= '<h2>MPAA</h2><p>'.$mpaa.'</p>';
$content.= '<h2>Run Time</h2><p>'.$run_time.'</p>';
$content.= '<h2>Full Details</h2><p><a href="'.$url.'" rel="nofollow">'.$url.'</a></p>';

echo $content;

//gets the match content
function get_match($regex,$content)
{
	preg_match($regex,$content,$matches);
	return $matches[1];
}

//gets the data from a URL
function get_data($url)
{
	$ch = curl_init();
	$timeout = 5;
	curl_setopt($ch,CURLOPT_URL,$url);
	curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
	curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
	$data = curl_exec($ch);
	curl_close($ch);
	return $data;
}

As with my other grabbers, the trick is always in the regular expressions. Note that the most important part of the URL is the string after "/title/". That string uniquely identifies the movie.

Recent Features

  • By
    Responsive and Infinitely Scalable JS Animations

    Back in late 2012 it was not easy to find open source projects using requestAnimationFrame() - this is the hook that allows Javascript code to synchronize with a web browser's native paint loop. Animations using this method can run at 60 fps and deliver fantastic...

  • By
    fetch API

    One of the worst kept secrets about AJAX on the web is that the underlying API for it, XMLHttpRequest, wasn't really made for what we've been using it for.  We've done well to create elegant APIs around XHR but we know we can do better.  Our effort to...

Incredible Demos

  • By
    Implementing Basic and Fancy Show/Hide in MooTools 1.2

    One of the great parts of MooTools is that the library itself allows for maximum flexibility within its provided classes. You can see evidence of this in the "Class" class' implement method. Using the implement method, you can add your own methods to...

  • By
    Drag &#038; Drop Elements to the Trash with MooTools 1.2

    Everyone loves dragging garbage files from their desktop into their trash can. There's a certain amount of irony in doing something on your computer that you also do in real life. It's also a quick way to get rid of things. That's...