Skip to the content...

Welcome to the David Walsh Blog. I'm a MooTools, Dojo, jQuery, CSS, and PHP Web Developer located in Madison, Wisconsin, United States. Please contact me if I can make your experience on my website better.

PHP IMDB Information Grabber

41 Responses »

It's been quite a while since I've written a PHP grabber and the itch finally got to me. This time the victim is the International Movie Database, otherwise known as IMDB. IMDB has info on every movie ever made (or so it seems). Their HTML source code is easy to parse so this one was a piece of cake.

The PHP

//url
$url = 'http://www.imdb.com/title/tt0367882/';

//get the page content
$imdb_content = get_data($url);

//parse for product name
$name = get_match('/<title>(.*)<\/title>/isU',$imdb_content);
$director = strip_tags(get_match('/<h5[^>]*>Director:<\/h5>(.*)<\/div>/isU',$imdb_content));
$plot = get_match('/<h5[^>]*>Plot:<\/h5>(.*)<\/div>/isU',$imdb_content);
$release_date = get_match('/<h5[^>]*>Release Date:<\/h5>(.*)<\/div>/isU',$imdb_content);
$mpaa = get_match('/<a href="\/mpaa">MPAA<\/a>:<\/h5>(.*)<\/div>/isU',$imdb_content);
$run_time = get_match('/Runtime:<\/h5>(.*)<\/div>/isU',$imdb_content);

//build content
$content.= '<h2>Film</h2><p>'.$name.'</p>';
$content.= '<h2>Director</h2><p>'.$director.'</p>';
$content.= '<h2>Plot</h2><p>'.substr($plot,0,strpos($plot,'<a')).'</p>';
$content.= '<h2>Release Date</h2><p>'.substr($release_date,0,strpos($release_date,'<a')).'</p>';
$content.= '<h2>MPAA</h2><p>'.$mpaa.'</p>';
$content.= '<h2>Run Time</h2><p>'.$run_time.'</p>';
$content.= '<h2>Full Details</h2><p><a href="'.$url.'" rel="nofollow">'.$url.'</a></p>';

echo $content;

//gets the match content
function get_match($regex,$content)
{
	preg_match($regex,$content,$matches);
	return $matches[1];
}

//gets the data from a URL
function get_data($url)
{
	$ch = curl_init();
	$timeout = 5;
	curl_setopt($ch,CURLOPT_URL,$url);
	curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
	curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
	$data = curl_exec($ch);
	curl_close($ch);
	return $data;
}

As with my other grabbers, the trick is always in the regular expressions. Note that the most important part of the URL is the string after "/title/". That string uniquely identifies the movie.

Discussion

  1. August 28, 2008 @ 1:03 pm

    It’s a shame that IMDB doesn’t have an API. Because if they ever change the layout of their site this script would fail.

  2. August 28, 2008 @ 1:04 pm

    @Mark: True, as is the case with any grabber. They really should put together an API.

  3. eran
    August 30, 2008 @ 9:19 am

    Nice, how can i grab the User Rating ?

  4. fabian
    August 30, 2008 @ 2:01 pm

    Cool! Thanks for the script and sharing.

  5. August 31, 2008 @ 10:12 am

    @Eran: By coding:

    $user&#95;rating = get&#95;match('/User Rating:<\/b> <b>(.*)<\/b>/isU',str&#95;replace("\n",'',$imdb&#95;content));
    
  6. August 31, 2008 @ 4:54 pm

    Hey David, awesome code. Do you know of a way to grab the imdb movie id? All i have is the Movie name, and i dont have the id to pass into your URL var.

    So how would i pass the movie name, then grab the id, then bobs my uncle?

  7. fabian
    September 26, 2008 @ 6:52 pm

    I’ll send you a “remake” of your script in the next few minutes, I just wanted to add the option which Adam asked for… :)

  8. jaredmellentine
    September 27, 2008 @ 10:12 am

    Very useful, although I’m curious why you wouldn’t use @file_get_contents() to get the page. CURL seems like overkill for this, but maybe it’s worth it for the timeout…

    Also, I would keep the regex in the function and return an array. Since the IMDB page structure could change at some point, it could break the regex. Having it in one place makes it easy to update – especially if your using this code in several places.

  9. September 27, 2008 @ 11:13 am

    There is a CPAN module for this for perl users. I use that in one of my scripts.

  10. bobby
    October 27, 2008 @ 5:58 am

    I just tried to create a regex expression to go fetch the poster src url from the IMDB page, but nothing I try seems to work. Would you be able to provide me the correct code to fetch the URL of the poster from IMDB?

    (It’s embodied in the <a name=”poster” tag)

  11. November 6, 2008 @ 5:35 am

    hi guys,
    there could be better method to get an image url from imdb but i though i could give it a start

    here is the code

    $img = explode(“" “, $name); $img
    $img = get_match(‘/title=”‘.$img[0].’"”

    src=”(.*)” \/>/isU’,$imdb_content);

  12. January 13, 2009 @ 9:47 pm

    IMDB really should have an API. Im sure they know that people copy their content either manually or dynamically like this. They should just save us the time and put together one…

    Also, do you think you could modify this to get the posters. I know that IMDB doesn’t allow hotlinking so you could spoof the referer using curl. Then mabey use some image hosting sites API to upload that picture? It would be really useful.

  13. February 8, 2009 @ 11:58 pm

    To show the user rating use the following code.

    $user_rating = get_match(‘/<div class=”meta”>\n<b>(.*)<\/b>/isU’,$imdb_content);

  14. john
    February 13, 2009 @ 3:10 pm

    How do i get the cast to be shown?

  15. mattias
    March 15, 2009 @ 8:57 am

    I was looking for something like this, but when I try to run the code i got this error:
    Fatal error: Call to undefined function curl_init() in C:\Projekt\imdb.php on line 38

    Any ideas?

  16. justin bell
    March 18, 2009 @ 8:30 pm

    I had that problem Mattias, it is because you don’t have curl installed on your server.

  17. hldn
    March 28, 2009 @ 8:28 pm

    @bobby @roy

    i use this code to get the poster url if there is one.

    $poster = @get_match(‘/<a name=”poster”.* src=”(.*)”.*<\/a>/’,$imdb_content);

  18. manuel
    April 28, 2009 @ 12:46 pm

    Very nice , i was looking for a script that could do it out of the box :)
    Maybe using Xpath instead of regex to scrape the page could save one some work when imdb changes their layout again (although maybe not enough to bother in the first place xD )

    Thanks for this snippet :)

  19. vadi
    May 24, 2009 @ 10:07 pm

    very good api! it is very important for us that the user rating is working. how to include the user rating into this php api?

    all the sayed results dont work!

  20. July 29, 2009 @ 10:26 am

    Very nice , has anyone tried writing a script that goes through all the films in imdb i.e.
    http://www.imdb.com/title/tt0000002/
    http://www.imdb.com/title/tt0000003/
    http://www.imdb.com/title/tt0000004/
    http://www.imdb.com/title/tt0999999/
    pulling out all the names etc and then using this info to perform a search on – how often would you need to update this ..

  21. August 18, 2009 @ 5:29 pm

    With this grabber I’ll finish my wordpress plugin…. I’ll put the link for download here when it’s ready…. thanks David!

  22. August 24, 2009 @ 6:14 am

    Could you pls tell me how to give movie name instead of movie id

  23. tejas
    August 31, 2009 @ 2:06 am

    Can we get a grabber that takes year, country & language as the input and then give the results as above? This way we get to know all the movies released in a particular year for a particular country for a particular language:)

  24. September 5, 2009 @ 10:36 am

    Awesome example. I’m new to cURL, and this is just what I needed. Thanks!

  25. September 29, 2009 @ 8:50 am

    how can i get the movie code like tt0367882 from the movie name ?

  26. September 29, 2009 @ 8:53 am

    i mean how can i make the script work for movie name not for code like tt0367882

  27. vyvyan
    November 22, 2009 @ 3:38 pm

    @oliver: really unnecessary to go this way. ‘coz if you want all you can simply grab their sql dump from ftp for peaceful (read fairuse) purpose.

  28. matthew
    January 13, 2010 @ 7:19 am

    Hi,
    The code doesn’t work for me… It’s as if the grabbing doesn’t work. For example, the ‘name’-part only outputs the following html: “Film”, it appears $name has no value.
    Am I overseeing something? Thanks in advance!

  29. February 8, 2010 @ 6:15 am

    @Mattias: You have to install cURL, use an Linux/Unix server with cURL extension for PHP enabled and of course with cURL installed on you Linux/Unix server.

  30. February 8, 2010 @ 10:11 pm

    Thanks David for the awesome script.

    Is there any way to grab that content if we don’t have $url?
    For example, if I have a movie “My name is Khan”, but I don’t want to go to IMDB to find out the URL of that movie, how can I get URL automatically?

    Any help would be appreciated.

  31. February 9, 2010 @ 4:15 pm

    Excellent. I’ll try this. Thanks David. :)

  32. matthew
    February 9, 2010 @ 4:26 pm

    @ Botnary:
    Thanks! I managed to get it working now.

    @Thong Tran:
    You can use the imdb-search engine instead ($url=’http://www.imdb.com/find?s=all&q=yourtitle’;), but it won’t work all the time. You can imagine why: Misspellings, movies with the same name, different movie titles in different countries etc.

    I used this script to grab the content (especially rating and runtime) of my entire movie collection (500+ films, you can imagine I used a loop :p). Using imdb’s search engine worked for 10-15% of my movies (and oh yeah, it took like 15 minutes:)..).

  33. April 15, 2010 @ 9:47 am

    searching for an older example i had someplace i stumbled across this short script working with cURL, so first i had to get that going in my localhost environment.

    next i wanted to change the url number so i can get through all the urls.
    i found the other examples here for the posters (doesn’t work with me somehow) and the user rating. here a small correction was required, blank spaces needed to be added. i also added a few other things i wanted to see.

    the partially working script (except for the posters) can be downloaded from http://www.downintheflood.com/download/imdb.txt
    then you save this file as imdb.php or imdb.php5 (as i did). to look for another number instead of the default number simply add ?number=999 when running the script from your site and you’ll get the info for movie number 999. i’ve stated in the script the last number available as of today.

    enjoy! and coninue the great support!

  34. oliver
    April 16, 2010 @ 8:59 am

    @yvyan::: where is the dump on the sql db ??
    @Joe Trixx: This text file crashed my browser every time must be huge – could you zip this and post the link in this forum – would love to take a look at this dump..

    Thanks

    Oliver

  35. April 16, 2010 @ 9:38 am

    it’s a very small file, just the script as above with few alterations: http://www.downintheflood.com/download/imdb.txt
    just putting the link here again. i have a problem with IE with this blog layout as well.

  36. April 18, 2010 @ 6:20 am

    i’ve worked on this file again. using a for loop now (which the user should set to the tt numbers he/she might be looking for, i.e. for ( $search=1; $search<=7500; $search++ ) for the first 7500 or for ( $search=18001; $search<=20000; $search++ ) for 18001 to 20000). don't grab them all, most likely you'll run out of memory. it saves the content to a csv file named imdb.csv. in this version i have temporarily disabled the display on the screen. also the links to the number or keyword search won't give satisfactory results. if the user doesn't want to include certain items, like the posters or awards, he/she can just comment out that line with //. so here's version two. be patient cause it runs slow writing the content to the files:
    http://www.downintheflood.com/download/imdb_v2.txt
    if someone is able to improve on the code your welcome to share it. perhaps one can speed it up somewhat. i ran the script last night going for 100000 entries, but forget that, it stopped just over 9000 after 3 hours.

    good luck.

  37. April 18, 2010 @ 6:32 am

    to answer the question on the sql dump, i searched for that one and here it is: http://www.imdb.com/interfaces

  38. April 18, 2010 @ 1:51 pm

    okay, sorry to be a nuisance, but i improvised the script again and got a version 4 now: http://www.downintheflood.com/download/imdb_v4.txt as well as a blank csv file except for the first line: http://www.downintheflood.com/download/imdb.csv and a script that can display a table from the csv file: http://www.downintheflood.com/download/imdb_csv.txt

    the alterations i’ve made were with the title, i had one str_replace in it too many, i also included a form that allows to enter a start and a stop number range for movies to search for.

  39. April 18, 2010 @ 1:53 pm

    i came across another problem with the plot display, some plots seem to contain links: http://www.imdb.com/title/tt0000008/ if i display the details of that movie with the original script it breaks just before the link. i think there’s other problems as well, i was unable to pull the genre, the keywords and awards properly. that’s why i left them off in this version 4 in the meantime. maybe some other people can jump in and help as well. enjoy what’s been done so far!

  40. tomaž
    June 23, 2010 @ 8:21 am

    I was searching for something like that the other day and found this on sourceforge: http://sourceforge.net/projects/imdbphp/ It gets a lot of data from IMDB.

  41. lenny
    August 28, 2010 @ 7:35 am

    Is there away to get the cast?

Be Heard!

Share your thoughts with fellow developers of all skill levels! I want to hear from you!

Name*:
Email*:
Website:  
Wrap your code with <code> tags, f00!