PHP IMDB Scraper

By  on  

It's been quite a while since I've written a PHP grabber and the itch finally got to me. This time the victim is the International Movie Database, otherwise known as IMDB. IMDB has info on every movie ever made (or so it seems). Their HTML source code is easy to parse so this one was a piece of cake.

The PHP

//url
$url = 'http://www.imdb.com/title/tt0367882/';

//get the page content
$imdb_content = get_data($url);

//parse for product name
$name = get_match('/<title>(.*)<\/title>/isU',$imdb_content);
$director = strip_tags(get_match('/<h5[^>]*>Director:<\/h5>(.*)<\/div>/isU',$imdb_content));
$plot = get_match('/<h5[^>]*>Plot:<\/h5>(.*)<\/div>/isU',$imdb_content);
$release_date = get_match('/<h5[^>]*>Release Date:<\/h5>(.*)<\/div>/isU',$imdb_content);
$mpaa = get_match('/<a href="\/mpaa">MPAA<\/a>:<\/h5>(.*)<\/div>/isU',$imdb_content);
$run_time = get_match('/Runtime:<\/h5>(.*)<\/div>/isU',$imdb_content);

//build content
$content.= '<h2>Film</h2><p>'.$name.'</p>';
$content.= '<h2>Director</h2><p>'.$director.'</p>';
$content.= '<h2>Plot</h2><p>'.substr($plot,0,strpos($plot,'<a')).'</p>';
$content.= '<h2>Release Date</h2><p>'.substr($release_date,0,strpos($release_date,'<a')).'</p>';
$content.= '<h2>MPAA</h2><p>'.$mpaa.'</p>';
$content.= '<h2>Run Time</h2><p>'.$run_time.'</p>';
$content.= '<h2>Full Details</h2><p><a href="'.$url.'" rel="nofollow">'.$url.'</a></p>';

echo $content;

//gets the match content
function get_match($regex,$content)
{
	preg_match($regex,$content,$matches);
	return $matches[1];
}

//gets the data from a URL
function get_data($url)
{
	$ch = curl_init();
	$timeout = 5;
	curl_setopt($ch,CURLOPT_URL,$url);
	curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
	curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
	$data = curl_exec($ch);
	curl_close($ch);
	return $data;
}

As with my other grabbers, the trick is always in the regular expressions. Note that the most important part of the URL is the string after "/title/". That string uniquely identifies the movie.

Recent Features

  • By
    5 HTML5 APIs You Didn&#8217;t Know Existed

    When you say or read "HTML5", you half expect exotic dancers and unicorns to walk into the room to the tune of "I'm Sexy and I Know It."  Can you blame us though?  We watched the fundamental APIs stagnate for so long that a basic feature...

  • By
    CSS Animations Between Media Queries

    CSS animations are right up there with sliced bread. CSS animations are efficient because they can be hardware accelerated, they require no JavaScript overhead, and they are composed of very little CSS code. Quite often we add CSS transforms to elements via CSS during...

Incredible Demos

  • By
    MooTools ASCII Art

    I didn't realize that I truly was a nerd until I could admit to myself that ASCII art was better than the pieces Picasso, Monet, or Van Gogh could create.  ASCII art is unmatched in its beauty, simplicity, and ... OK, well, I'm being ridiculous;  ASCII...

  • By
    Create a Trailing Mouse Cursor Effect Using MooTools

    Remember the old days of DHTML and effects that were an achievement to create but had absolutely no value? Well, a trailing mouse cursor script is sorta like that. And I'm sorta the type of guy that creates effects just because I can.