O'Reilly

PHP IMDB Scraper

By on  

It's been quite a while since I've written a PHP grabber and the itch finally got to me. This time the victim is the International Movie Database, otherwise known as IMDB. IMDB has info on every movie ever made (or so it seems). Their HTML source code is easy to parse so this one was a piece of cake.

The PHP

//url
$url = 'http://www.imdb.com/title/tt0367882/';

//get the page content
$imdb_content = get_data($url);

//parse for product name
$name = get_match('/<title>(.*)<\/title>/isU',$imdb_content);
$director = strip_tags(get_match('/<h5[^>]*>Director:<\/h5>(.*)<\/div>/isU',$imdb_content));
$plot = get_match('/<h5[^>]*>Plot:<\/h5>(.*)<\/div>/isU',$imdb_content);
$release_date = get_match('/<h5[^>]*>Release Date:<\/h5>(.*)<\/div>/isU',$imdb_content);
$mpaa = get_match('/<a href="\/mpaa">MPAA<\/a>:<\/h5>(.*)<\/div>/isU',$imdb_content);
$run_time = get_match('/Runtime:<\/h5>(.*)<\/div>/isU',$imdb_content);

//build content
$content.= '<h2>Film</h2><p>'.$name.'</p>';
$content.= '<h2>Director</h2><p>'.$director.'</p>';
$content.= '<h2>Plot</h2><p>'.substr($plot,0,strpos($plot,'<a')).'</p>';
$content.= '<h2>Release Date</h2><p>'.substr($release_date,0,strpos($release_date,'<a')).'</p>';
$content.= '<h2>MPAA</h2><p>'.$mpaa.'</p>';
$content.= '<h2>Run Time</h2><p>'.$run_time.'</p>';
$content.= '<h2>Full Details</h2><p><a href="'.$url.'" rel="nofollow">'.$url.'</a></p>';

echo $content;

//gets the match content
function get_match($regex,$content)
{
	preg_match($regex,$content,$matches);
	return $matches[1];
}

//gets the data from a URL
function get_data($url)
{
	$ch = curl_init();
	$timeout = 5;
	curl_setopt($ch,CURLOPT_URL,$url);
	curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
	curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
	$data = curl_exec($ch);
	curl_close($ch);
	return $data;
}

As with my other grabbers, the trick is always in the regular expressions. Note that the most important part of the URL is the string after "/title/". That string uniquely identifies the movie.

Track.js Error Reporting

Upcoming Events

Recent Features

  • Creating Scrolling Parallax Effects with CSS

    Introduction For quite a long time now websites with the so called "parallax" effect have been really popular. In case you have not heard of this effect, it basically includes different layers of images that are moving in different directions or with different speed. This leads to a...

  • Responsive and Infinitely Scalable JS Animations

    Back in late 2012 it was not easy to find open source projects using requestAnimationFrame() – this is the hook that allows Javascript code to synchronize with a web browser's native paint loop. Animations using this method can run at 60 fps and deliver fantastic...

Incredible Demos

  • CSS Tooltips

    We all know that you can make shapes with CSS and a single HTML element, as I've covered in my CSS Triangles and CSS Circles posts.  Triangles and circles are fairly simply though, so as CSS advances, we need to stretch the boundaries...

  • Dynamically Load Stylesheets Using MooTools 1.2

    Theming has become a big part of the Web 2.0 revolution. Luckily, so too has a higher regard for semantics and CSS standards. If you build your pages using good XHTML code, changing a CSS file can make your website look completely different....

Recently on David Walsh Blog

  • OâReilly Velocity Conference â New York

    My favorite front-end conference has always been O'Reilly's Velocity Conference because the conference series has focused on one of the most undervalued parts of client side coding:  speed.  So often we're so excited that our JavaScript works that we forget that speed, efficiency, and performance are just as important. The next Velocity...

  • Free Download: Font Bundle Featuring 17 Incredible Typefaces

    The only thing we love more than a good font, is a good free font. So we’ve combed the Web for some of our favorite free fonts, and gathered them here in a single download. You’ll find a variety of useful typefaces, from highly geometric designs...

  • OâReilly Velocity Conference â Amsterdam

    My favorite front-end conference has always been O'Reilly's Velocity Conference because the conference series has focused on one of the most undervalued parts of client side coding:  speed.  So often we're so excited that our JavaScript works that we forget that speed, efficiency, and performance are just as important. The next Velocity...

  • CanIUse Command Line

    Every front-end developer should be well acquainted with CanIUse, the website that lets you view browser support for browser features.  When people criticize my blog posts for not detailing browser support for features within the post, I tell them to check CanIUse:  always up to date, unlike...

  • Generating Alternative Stylesheets for Browsers Without @media

    If your CSS code is built with a mobile-first approach, it probably contains all the rules that make up the "desktop" view inside @media statements. That's great, but browsers that don't support media queries (IE 8 and below) will simply ignore them, ending up getting the...