Read PDF and Word DOC Files Using PHP

By  on  

One of my customers has an insane amount of PDF and Microsoft Word DOC files on their website. It's core to their online services so it's not as though they're garbage files up on the server. My customer wanted their website's search engine (Sphider) to read these PDF files and DOC files so that their clients could get at the documents they needed without going through a bunch of summary pages to get them. I was successful in the task, so let me show you how to read PDF and DOC files using PHP.

Reading PDF Files

To read PDF files, you will need to install the XPDF package, which includes "pdftotext." Once you have XPDF/pdftotext installed, you run the following PHP statement to get the PDF text:

$content = shell_exec('/usr/local/bin/pdftotext '.$filename.' -'); //dash at the end to output content

Reading DOC Files

Like the PDF example above, you'll need to download another package. This package is called Antiword. Here's the code to grab the Word DOC content:

$content = shell_exec('/usr/local/bin/antiword '.$filename);

The above code does NOT read DOCX files and does not (and purposely so) preserve formatting. There are other libraries that will preserve formatting but in our case, we just want to get at the text.

A special thank you to Jeremy Parrish for his help and insight with this task.

Recent Features

Incredible Demos

  • By
    Styling CSS Print Page Breaks

    It's important to construct your websites in a fashion that lends well to print. I use a page-break CSS class on my websites to tell the browser to insert a page break at strategic points on the page. During the development of my...

  • By
    MooTools dwCheckboxes Plugin

    Update / Fix: The checkboxes will no longer toggle when the "mouseup" event doesn't occur on a checkbox. Every morning I wake up to a bunch of emails in my Gmail inbox that I delete without reading. I end up clicking so many damn checkboxes...

Discussion

  1. Cool. I wonder if there is any solution that doesn’t need extra software, i.e. a simple PHP Class or something. Maybe a future project? Anyway, this solution is pretty elegant and works well if you have full access to the server your page is on.

  2. @Simon Sigurdhsson: I don’t know of a pure-PHP way of doing this. I know that if you’re on Windows/IIS, you can use the COM library. Other than that, these methods are the only I know of.

  3. I’ve wanted to read DOC files before but came out with nothing that would work with my hosting since it’s shared hosting and there is no allowance for shell_exec.

  4. I have tried this approach long back, but it doesn’t work for all PDF versions. Have you tested this with all PDF Versions?

  5. This could come in very handy with some of my projects. Thanks!

    Brenelz

  6. Looks good. Shame this requires extra software though. It would be good if you could mabey parse the document into an image.

  7. Simon

    For PDF’s Read all comments especially jorromer’s

    pdf to text. php manual page entry

  8. Max

    Is anyone able to provide a link to a 32 bit binary of the Linux version of antiword? I don’t have shell access to the server I work on.

  9. lame

    i have a project which entails i search through (.pdf/ .doc) files i this development will definately come handy! thnx, hope it does work or close to doing so…

  10. Koushik Ghoah

    Anyone please help me out.
    Currently I’m working in a project for that I have to read pdf, doc or docx file using php code from the localhost. Is it possible from localhost?Then please send me the code.

    Thank you,
    Koushik Ghosh

  11. That is cool, thanks for your post.

    You would like to post more shell tutorial.

    :D

    cong nguyen

    http://www.neoob.com/

  12. check this out – pdf to txt in pure php – http://community.livejournal.com/php/295413.html … back in 2005. leet. :|

    That should ease some future stress.. I hope that script is still functional, as I haven’t had a chance to try myself.

  13. Rony

    I need help about how to read bangla from doc file in php.If anyone know please send me code.It is very much urgent.

  14. \\.\

    How would you do this on a web hosted space, assuming that you have no access to services other than those set up for the package you buy, meaning that you can not install stuff on the remote computer?

  15. Anyone please help me out.
    Currently I’m working in a project for that I have to read and write doc or docx file using php xml code from the localhost. Is it possible from localhost?Then please send me the code or url.

    Thank you,
    Mindaugas

  16. ochi
    function openPdf()
    {
    var omyFrame = document.getElementById("myFrame");
    omyFrame.style.display="block";
    omyFrame.src = "myFile.pdf";
    }
    
  17. nanhe

    how can install xpdf on wamp server

  18. Thiru

    Thanks for ur help to giving the instructions for how to read data from pdf files.
    I did fallow the instructions whatever u have given.
    it is working fine for localhost which is on windows platform.
    Now i just wanted to run it on my web server…
    can u tell me what r the changes do i have to do and where?????????????
    Thanks inadvance……..

  19. tarun agarwal

    @Simon Sigurdhsson:

    can u get me the programming code for that search engine???…

  20. selvakumar

    @Thiru: thiru sir tell me how to store the content of word document into data base. please tell to me through my mail

  21. raji

    how to view the document file in the same page using php

  22. I just tested and its work, thanks for great article

  23. chaos

    antiword works like a charm from shell but i only get a slightly fucked up first line in my var if i run it from php, can u tell me what i’m doing wrong? :)

  24. SP

    I got the solution :
    $content = shell_exec(‘/usr/bin/antiword -f -w 0 formatting.doc’);
    I forgot about -w argument, it will give you whole line or you can define value of -w as required width of line like 30, 40 etc.

  25. Marcelo de Almeida Braga

    On the local server (wamp) the code works. In the web server (linux) do not get the file’s contents.

    $content = shell_exec("pdftotext". "sumario.pdf." ' -');
    echo $content;
    

    Thanks in advance.

  26. is there any windows version?

  27. i’ve been using this. not bad at all. good work around. sadly antiword can’t read docx files

  28. I wonder if anyone actually get it to work. I’ve tried but it doesn’t work for me. Dyo, did you actually get it to work? How did you do different or what kind of web server you running on? I’ve tried it on Ubuntu10.04, Apache2 with PHP5. can you share your code here? Here’s mine:


  29. $filename = '/var/www/myfiles/mydoc.doc';
    $content = shell_exec('/usr/local/bin/antiword '.$filename);
    echo $content;

  30. sarfaraz ali

    For getting full string search from pdf then visit to sarfarazali.co.cc

  31. prase

    Hello i have tried your tutorial, i tried in local host. But nothing happen, can you give me solution ??

  32. Dude…. this is epic.

    I visited this page 2 days ago trying to develop a search engine for PDFs and had no clue what this meant. Now it makes sense to me and I’m going to use this. Thanks!

  33. Hello david, I used your xpdf and antiword.

    xpdf is working well with my php but antiword is not executing if antiword folder is not installed on c:/antiword/bin directory.

    I dont want to execute from the c:/ drive i would like to run it from php from my htdocs directory but its not working.

    How can I do this can anyone help???

    Example:::

    CODE FOR XPDF( Working ):
    $page_content = shell_exec('C:/xampp/htdocs/search-includes/xpdfbin-win-3.03/bin32/pdftotext '.$filename.' -');

    CODE FOR ANTIWORD: ( This is not working )
    $page_content = shell_exec("C:/xampp/htdocs/search-includes/antiword/bin/antiword ".$filename);

    CODE FOR ANTIWORD:( Working )
    $page_content = shell_exec("C:/antiword/bin/antiword ".$filename);

  34. Nick

    @Randy – search engine for PDFs

    Hi Randy,

    iF You have root access to your server, You could try Apache Tomcat with Apache SOLR and You will obtain the same effect for PDF, Word, and some other formats – should take a little time to check which formats are supported.

    Kind Regards,
    Nick

  35. Bhashitha

    Thankz ‘shivarajrh’ your piece of cord realy helps me….)

  36. Bastien

    Hello,

    I’d really like to use those two packages but i don’t really know how to install them ( I do have ssh access to my apache server but don’t know how to install this kind of package. )
    Could you help me ? I searched a lot on the web but did not find an adequat solution and don’t wan’t to make mistakes et troubles to my system.

  37. nice article… it would be better if you have made a demo

  38. It works well , i am creating a mobile handler that will open PDF files even in mobile phones without downloading it actually.

    I tested the code by installing XPDF and open files like this

    Thanks again.

  39. I tried this code
    error_reporting(0);
    $file=file_get_contents($_GET[‘url’]);
    $file_name=rand(100000,100000000);
    file_put_contents($file_name,$file);
    $c=shell_exec(‘pdftotext ‘.$file_name);
    header(‘Content-Type: text/plain’);
    echo $c;
    unlink($file_name);

  40. oaattia

    How can i install antiword and XPDF on my vps server, my VPS server runs redhat
    please any help appreciated …

  41. Raheem Shaik

    hii.. i want to know how to install those PACKAGES in linux.. please can anyone tell me the steps to install it..

  42. My hosting has been disabled ‘shell_exec’ function. What can i do it ? Thank you.

  43. Dolar

    Hi!
    please say me it’s work in windows os?
    pls pls reply me!!

    Thanks! in advanced.

  44. Gisele

    Hello, you mentioned that there are other libraries keeping formatting and images. I need directions grateful if possible help me

  45. Alex
    $content = shell_exec('/usr/local/bin/antiword '.$filename); 

    its worked, thanks

Wrap your code in <pre class="{language}"></pre> tags, link to a GitHub gist, JSFiddle fiddle, or CodePen pen to embed!