PHP IMDB Information Grabber

Written by David Walsh on August 28, 2008 · 53 Comments

It's been quite a while since I've written a PHP grabber and the itch finally got to me. This time the victim is the International Movie Database, otherwise known as IMDB. IMDB has info on every movie ever made (or so it seems). Their HTML source code is easy to parse so this one was a piece of cake.

The PHP

//url
$url = 'http://www.imdb.com/title/tt0367882/';

//get the page content
$imdb_content = get_data($url);

//parse for product name
$name = get_match('/<title>(.*)<\/title>/isU',$imdb_content);
$director = strip_tags(get_match('/<h5[^>]*>Director:<\/h5>(.*)<\/div>/isU',$imdb_content));
$plot = get_match('/<h5[^>]*>Plot:<\/h5>(.*)<\/div>/isU',$imdb_content);
$release_date = get_match('/<h5[^>]*>Release Date:<\/h5>(.*)<\/div>/isU',$imdb_content);
$mpaa = get_match('/<a href="\/mpaa">MPAA<\/a>:<\/h5>(.*)<\/div>/isU',$imdb_content);
$run_time = get_match('/Runtime:<\/h5>(.*)<\/div>/isU',$imdb_content);

//build content
$content.= '<h2>Film</h2><p>'.$name.'</p>';
$content.= '<h2>Director</h2><p>'.$director.'</p>';
$content.= '<h2>Plot</h2><p>'.substr($plot,0,strpos($plot,'<a')).'</p>';
$content.= '<h2>Release Date</h2><p>'.substr($release_date,0,strpos($release_date,'<a')).'</p>';
$content.= '<h2>MPAA</h2><p>'.$mpaa.'</p>';
$content.= '<h2>Run Time</h2><p>'.$run_time.'</p>';
$content.= '<h2>Full Details</h2><p><a href="'.$url.'" rel="nofollow">'.$url.'</a></p>';

echo $content;

//gets the match content
function get_match($regex,$content)
{
	preg_match($regex,$content,$matches);
	return $matches[1];
}

//gets the data from a URL
function get_data($url)
{
	$ch = curl_init();
	$timeout = 5;
	curl_setopt($ch,CURLOPT_URL,$url);
	curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
	curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
	$data = curl_exec($ch);
	curl_close($ch);
	return $data;
}

As with my other grabbers, the trick is always in the regular expressions. Note that the most important part of the URL is the string after "/title/". That string uniquely identifies the movie.

Comments

  1. It’s a shame that IMDB doesn’t have an API. Because if they ever change the layout of their site this script would fail.

  2. @Mark: True, as is the case with any grabber. They really should put together an API.

  3. Nice, how can i grab the User Rating ?

  4. Cool! Thanks for the script and sharing.

  5. @Eran: By coding:

    $user&#95;rating = get&#95;match('/User Rating:<\/b> <b>(.*)<\/b>/isU',str&#95;replace("\n",'',$imdb&#95;content));
    
  6. Hey David, awesome code. Do you know of a way to grab the imdb movie id? All i have is the Movie name, and i dont have the id to pass into your URL var.

    So how would i pass the movie name, then grab the id, then bobs my uncle?

  7. I’ll send you a “remake” of your script in the next few minutes, I just wanted to add the option which Adam asked for… :)

  8. jaredmellentine September 27, 2008

    Very useful, although I’m curious why you wouldn’t use @file_get_contents() to get the page. CURL seems like overkill for this, but maybe it’s worth it for the timeout…

    Also, I would keep the regex in the function and return an array. Since the IMDB page structure could change at some point, it could break the regex. Having it in one place makes it easy to update – especially if your using this code in several places.

  9. There is a CPAN module for this for perl users. I use that in one of my scripts.

  10. I just tried to create a regex expression to go fetch the poster src url from the IMDB page, but nothing I try seems to work. Would you be able to provide me the correct code to fetch the URL of the poster from IMDB?

    (It’s embodied in the <a name=”poster” tag)

  11. hi guys,
    there could be better method to get an image url from imdb but i though i could give it a start

    here is the code

    $img = explode(“" “, $name); $img
    $img = get_match(‘/title=”‘.$img[0].’"”

    src=”(.*)” \/>/isU’,$imdb_content);

  12. IMDB really should have an API. Im sure they know that people copy their content either manually or dynamically like this. They should just save us the time and put together one…

    Also, do you think you could modify this to get the posters. I know that IMDB doesn’t allow hotlinking so you could spoof the referer using curl. Then mabey use some image hosting sites API to upload that picture? It would be really useful.

  13. To show the user rating use the following code.

    $user_rating = get_match(‘/<div class=”meta”>\n<b>(.*)<\/b>/isU’,$imdb_content);

  14. How do i get the cast to be shown?

  15. Mattias March 15, 2009

    I was looking for something like this, but when I try to run the code i got this error:
    Fatal error: Call to undefined function curl_init() in C:\Projekt\imdb.php on line 38

    Any ideas?

  16. Justin Bell March 18, 2009

    I had that problem Mattias, it is because you don’t have curl installed on your server.

  17. @bobby @roy

    i use this code to get the poster url if there is one.

    $poster = @get_match(‘/<a name=”poster”.* src=”(.*)”.*<\/a>/’,$imdb_content);

  18. Very nice , i was looking for a script that could do it out of the box :)
    Maybe using Xpath instead of regex to scrape the page could save one some work when imdb changes their layout again (although maybe not enough to bother in the first place xD )

    Thanks for this snippet :)

  19. very good api! it is very important for us that the user rating is working. how to include the user rating into this php api?

    all the sayed results dont work!

  20. Very nice , has anyone tried writing a script that goes through all the films in imdb i.e.
    http://www.imdb.com/title/tt0000002/
    http://www.imdb.com/title/tt0000003/
    http://www.imdb.com/title/tt0000004/
    http://www.imdb.com/title/tt0999999/
    pulling out all the names etc and then using this info to perform a search on – how often would you need to update this ..

  21. With this grabber I’ll finish my wordpress plugin…. I’ll put the link for download here when it’s ready…. thanks David!

  22. Could you pls tell me how to give movie name instead of movie id

  23. Can we get a grabber that takes year, country & language as the input and then give the results as above? This way we get to know all the movies released in a particular year for a particular country for a particular language:)

  24. Awesome example. I’m new to cURL, and this is just what I needed. Thanks!

  25. how can i get the movie code like tt0367882 from the movie name ?

  26. i mean how can i make the script work for movie name not for code like tt0367882

  27. @oliver: really unnecessary to go this way. ‘coz if you want all you can simply grab their sql dump from ftp for peaceful (read fairuse) purpose.

  28. Hi,
    The code doesn’t work for me… It’s as if the grabbing doesn’t work. For example, the ‘name’-part only outputs the following html: “Film”, it appears $name has no value.
    Am I overseeing something? Thanks in advance!

  29. @Mattias: You have to install cURL, use an Linux/Unix server with cURL extension for PHP enabled and of course with cURL installed on you Linux/Unix server.

  30. Thanks David for the awesome script.

    Is there any way to grab that content if we don’t have $url?
    For example, if I have a movie “My name is Khan”, but I don’t want to go to IMDB to find out the URL of that movie, how can I get URL automatically?

    Any help would be appreciated.

  31. Excellent. I’ll try this. Thanks David. :)

  32. @ Botnary:
    Thanks! I managed to get it working now.

    @Thong Tran:
    You can use the imdb-search engine instead ($url=’http://www.imdb.com/find?s=all&q=yourtitle’;), but it won’t work all the time. You can imagine why: Misspellings, movies with the same name, different movie titles in different countries etc.

    I used this script to grab the content (especially rating and runtime) of my entire movie collection (500+ films, you can imagine I used a loop :p). Using imdb’s search engine worked for 10-15% of my movies (and oh yeah, it took like 15 minutes:)..).

  33. searching for an older example i had someplace i stumbled across this short script working with cURL, so first i had to get that going in my localhost environment.

    next i wanted to change the url number so i can get through all the urls.
    i found the other examples here for the posters (doesn’t work with me somehow) and the user rating. here a small correction was required, blank spaces needed to be added. i also added a few other things i wanted to see.

    the partially working script (except for the posters) can be downloaded from http://www.downintheflood.com/download/imdb.txt
    then you save this file as imdb.php or imdb.php5 (as i did). to look for another number instead of the default number simply add ?number=999 when running the script from your site and you’ll get the info for movie number 999. i’ve stated in the script the last number available as of today.

    enjoy! and coninue the great support!

  34. @yvyan::: where is the dump on the sql db ??
    @Joe Trixx: This text file crashed my browser every time must be huge – could you zip this and post the link in this forum – would love to take a look at this dump..

    Thanks

    Oliver

  35. it’s a very small file, just the script as above with few alterations: http://www.downintheflood.com/download/imdb.txt
    just putting the link here again. i have a problem with IE with this blog layout as well.

  36. i’ve worked on this file again. using a for loop now (which the user should set to the tt numbers he/she might be looking for, i.e. for ( $search=1; $search<=7500; $search++ ) for the first 7500 or for ( $search=18001; $search<=20000; $search++ ) for 18001 to 20000). don't grab them all, most likely you'll run out of memory. it saves the content to a csv file named imdb.csv. in this version i have temporarily disabled the display on the screen. also the links to the number or keyword search won't give satisfactory results. if the user doesn't want to include certain items, like the posters or awards, he/she can just comment out that line with //. so here's version two. be patient cause it runs slow writing the content to the files:
    http://www.downintheflood.com/download/imdb_v2.txt
    if someone is able to improve on the code your welcome to share it. perhaps one can speed it up somewhat. i ran the script last night going for 100000 entries, but forget that, it stopped just over 9000 after 3 hours.

    good luck.

  37. to answer the question on the sql dump, i searched for that one and here it is: http://www.imdb.com/interfaces

  38. okay, sorry to be a nuisance, but i improvised the script again and got a version 4 now: http://www.downintheflood.com/download/imdb_v4.txt as well as a blank csv file except for the first line: http://www.downintheflood.com/download/imdb.csv and a script that can display a table from the csv file: http://www.downintheflood.com/download/imdb_csv.txt

    the alterations i’ve made were with the title, i had one str_replace in it too many, i also included a form that allows to enter a start and a stop number range for movies to search for.

  39. i came across another problem with the plot display, some plots seem to contain links: http://www.imdb.com/title/tt0000008/ if i display the details of that movie with the original script it breaks just before the link. i think there’s other problems as well, i was unable to pull the genre, the keywords and awards properly. that’s why i left them off in this version 4 in the meantime. maybe some other people can jump in and help as well. enjoy what’s been done so far!

  40. Tomaž June 23, 2010

    I was searching for something like that the other day and found this on sourceforge: http://sourceforge.net/projects/imdbphp/ It gets a lot of data from IMDB.

  41. Is there away to get the cast?

  42. this script no more workable for new IMDB layout. give something new

  43. Thanks a lot for this script. I use it as to get the imdb url for a search string which I fire off to the facebook graph api for my fb movie app http://sharemovi.es

  44. nameless April 23, 2011

    im testing this script under windows. and have enabled Curl thingy, cant get more than title off it tho :/ any ideas? or any scripts? and how to add poster to it? :D

  45. Nice! I’ve written a PHP class that lets you give a title and year and it does the rest.

    https://github.com/aramkocharyan/IMDb-Scraper

    Cheers.

  46. I drop a leave a response when I especially enjoy a post on a website or I have something to valuable to contribute to the discussion. It is caused by the fire displayed in the article I browsed. And on this article PHP IMDB Information Grabber. I was excited enough to post a comment ;) I actually do have some questions for you if you tend not to mind. Is it just me or does it give the impression like a few of these remarks look like they are coming from brain dead individuals? :-P And, if you are posting at other places, I would like to keep up with anything new you have to post. Would you make a list all of all your communal pages like your Facebook page, twitter feed, or linkedin profile?

  47. Hey thanks for this!

    What you developed seems to be close to exactly what I need – but I need to cycle through the list of all my movies & retrieve the IMDB Movie ID based on that script which I wrote in VBA in Excel.

    Can you advsie how I plug this code in?

  48. Hi there, I enjoy reading through your article. I wanted to write a
    little comment to support you.

  49. At that time, I was a very ordinary promoter
    at a small company, of course, I earned not much
    money which was just enough to support myself. When a patient comes to me with symptoms of this disorder, I do not
    respond with mind-dulling medications or invasive techniques.
    Her book doesn’t just give a temporary quick fix, but provides all the information necessary for understanding, combating and ultimately curing yeast infection. Yet again, new advances and engineering make these procedures rapid and helpful, and this cuts down on the value tremendously. o – Focus on the optometry practice (no ambition to supply a retail eyewear answer).

Be Heard

Tip: Wrap your code in <pre> tags or link to a GitHub Gist!

Use Code Editor
Older
Web Hosting Search Interview
Newer
PHP, ODBC, and nvarchar