O'Reilly

PHP IMDB Scraper

By on  

It's been quite a while since I've written a PHP grabber and the itch finally got to me. This time the victim is the International Movie Database, otherwise known as IMDB. IMDB has info on every movie ever made (or so it seems). Their HTML source code is easy to parse so this one was a piece of cake.

The PHP

//url
$url = 'http://www.imdb.com/title/tt0367882/';

//get the page content
$imdb_content = get_data($url);

//parse for product name
$name = get_match('/<title>(.*)<\/title>/isU',$imdb_content);
$director = strip_tags(get_match('/<h5[^>]*>Director:<\/h5>(.*)<\/div>/isU',$imdb_content));
$plot = get_match('/<h5[^>]*>Plot:<\/h5>(.*)<\/div>/isU',$imdb_content);
$release_date = get_match('/<h5[^>]*>Release Date:<\/h5>(.*)<\/div>/isU',$imdb_content);
$mpaa = get_match('/<a href="\/mpaa">MPAA<\/a>:<\/h5>(.*)<\/div>/isU',$imdb_content);
$run_time = get_match('/Runtime:<\/h5>(.*)<\/div>/isU',$imdb_content);

//build content
$content.= '<h2>Film</h2><p>'.$name.'</p>';
$content.= '<h2>Director</h2><p>'.$director.'</p>';
$content.= '<h2>Plot</h2><p>'.substr($plot,0,strpos($plot,'<a')).'</p>';
$content.= '<h2>Release Date</h2><p>'.substr($release_date,0,strpos($release_date,'<a')).'</p>';
$content.= '<h2>MPAA</h2><p>'.$mpaa.'</p>';
$content.= '<h2>Run Time</h2><p>'.$run_time.'</p>';
$content.= '<h2>Full Details</h2><p><a href="'.$url.'" rel="nofollow">'.$url.'</a></p>';

echo $content;

//gets the match content
function get_match($regex,$content)
{
	preg_match($regex,$content,$matches);
	return $matches[1];
}

//gets the data from a URL
function get_data($url)
{
	$ch = curl_init();
	$timeout = 5;
	curl_setopt($ch,CURLOPT_URL,$url);
	curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
	curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
	$data = curl_exec($ch);
	curl_close($ch);
	return $data;
}

As with my other grabbers, the trick is always in the regular expressions. Note that the most important part of the URL is the string after "/title/". That string uniquely identifies the movie.

Track.js Error Reporting

Recent Features

  • From Webcam to Animated GIF: the Secret Behind chat.meatspac.es!

    My team mate Edna Piranha is not only an awesome hacker; she's also a fantastic philosopher! Communication and online interactions is a subject that has kept her mind busy for a long time, and it has also resulted in a bunch of interesting experimental projects...

  • Designing for Simplicity

    Before we get started, it's worth me spending a brief moment introducing myself to you. My name is Mark (or @integralist if Twitter happens to be your communication tool of choice) and I currently work for BBC News in London England as a principal engineer/tech...

Incredible Demos

  • MooTools Flashlight Effect

    Another reason that I love Twitter so much is that I'm able to check out what fellow developers think is interesting. Chris Coyier posted about a flashlight effect he found built with jQuery. While I agree with Chris that it's a little corny, it...

  • Facebook Open Graph META Tags

    It's no secret that Facebook has become a major traffic driver for all types of websites.  Nowadays even large corporations steer consumers toward their Facebook pages instead of the corporate websites directly.  And of course there are Facebook "Like" and "Recommend" widgets on every website.  One...

Recently on David Walsh Blog

  • Chris Coyierâs Favorite CodePen Demos II

    Hey everyone! Before we get started, I just want to say it’s damn hard to pick this few favorites on CodePen. Not because, as a co-founder of CodePen, I feel like a dad picking which kid he likes best (RUDE). But because there is just so...

  • GSAP + SVG For Power Users: Motion Along A Path

    Now that the GreenSock API is picking up steam, there are many tutorials and Getting Started guides out there to provide good introductions to the library, not to mention GreenSock’s own Forum and Documentation. This article isn’t intended for beginners, but rather a...

  • Copy a Directory from Command Line

    Copying a directory for the sake of backup is something I do often, especially when I'm trying to figure out why something isn't working when I use an external library.  I'll copy the directory structure as a backup, mess around with the original source until I find a solution,...

  • Hotjar &#8211; All-in-one Analytics and Feedback

    Website analytics are a massive business -- the more data you can collect with regard to your users' behaviors on your site, the more you can increase and maximise conversion...and increased conversion is always good.  Sometimes increase conversion means more money, improved user experience, viewer retention,...

  • Crafting a 3D React Carousel

    There is something in me that is amazed but beautiful 3D interfaces. And it doesn’t matter whether they’re functional like Gyroscope features menu, technology demonstrators like the amazing periodic table demo from famous or they’re artistic representation pushing the limits of...