O'Reilly

PHP IMDB Scraper

By on  

It's been quite a while since I've written a PHP grabber and the itch finally got to me. This time the victim is the International Movie Database, otherwise known as IMDB. IMDB has info on every movie ever made (or so it seems). Their HTML source code is easy to parse so this one was a piece of cake.

The PHP

//url
$url = 'http://www.imdb.com/title/tt0367882/';

//get the page content
$imdb_content = get_data($url);

//parse for product name
$name = get_match('/<title>(.*)<\/title>/isU',$imdb_content);
$director = strip_tags(get_match('/<h5[^>]*>Director:<\/h5>(.*)<\/div>/isU',$imdb_content));
$plot = get_match('/<h5[^>]*>Plot:<\/h5>(.*)<\/div>/isU',$imdb_content);
$release_date = get_match('/<h5[^>]*>Release Date:<\/h5>(.*)<\/div>/isU',$imdb_content);
$mpaa = get_match('/<a href="\/mpaa">MPAA<\/a>:<\/h5>(.*)<\/div>/isU',$imdb_content);
$run_time = get_match('/Runtime:<\/h5>(.*)<\/div>/isU',$imdb_content);

//build content
$content.= '<h2>Film</h2><p>'.$name.'</p>';
$content.= '<h2>Director</h2><p>'.$director.'</p>';
$content.= '<h2>Plot</h2><p>'.substr($plot,0,strpos($plot,'<a')).'</p>';
$content.= '<h2>Release Date</h2><p>'.substr($release_date,0,strpos($release_date,'<a')).'</p>';
$content.= '<h2>MPAA</h2><p>'.$mpaa.'</p>';
$content.= '<h2>Run Time</h2><p>'.$run_time.'</p>';
$content.= '<h2>Full Details</h2><p><a href="'.$url.'" rel="nofollow">'.$url.'</a></p>';

echo $content;

//gets the match content
function get_match($regex,$content)
{
	preg_match($regex,$content,$matches);
	return $matches[1];
}

//gets the data from a URL
function get_data($url)
{
	$ch = curl_init();
	$timeout = 5;
	curl_setopt($ch,CURLOPT_URL,$url);
	curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
	curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
	$data = curl_exec($ch);
	curl_close($ch);
	return $data;
}

As with my other grabbers, the trick is always in the regular expressions. Note that the most important part of the URL is the string after "/title/". That string uniquely identifies the movie.

Treehouse Learning

Recent Features

  • Creating Scrolling Parallax Effects with CSS

    Introduction For quite a long time now websites with the so called "parallax" effect have been really popular. In case you have not heard of this effect, it basically includes different layers of images that are moving in different directions or with different speed. This leads to a...

  • Designing for Simplicity

    Before we get started, it's worth me spending a brief moment introducing myself to you. My name is Mark (or @integralist if Twitter happens to be your communication tool of choice) and I currently work for BBC News in London England as a principal engineer/tech...

Incredible Demos

  • Create a Brilliant Sprited, CSS-Powered Firefox Animation

    Mozilla recently formally announced Firefox OS and its partners at Mobile World Congress and I couldn't be more excited.  Firefox OS is going to change the lives of people in developing countries, hopefully making a name for itself in the US as well.  The...

  • CSS Selection Styling

    The goal of CSS is to allow styling of content and structure within a web page.  We all know that, right?  As CSS revisions arrive, we're provided more opportunity to control.  One of the little known styling option available within the browser is text selection styling....