Read PDF and Word DOC Files Using PHP

By on  

One of my customers has an insane amount of PDF and Microsoft Word DOC files on their website. It's core to their online services so it's not as though they're garbage files up on the server. My customer wanted their website's search engine (Sphider) to read these PDF files and DOC files so that their clients could get at the documents they needed without going through a bunch of summary pages to get them. I was successful in the task, so let me show you how to read PDF and DOC files using PHP.

Reading PDF Files

To read PDF files, you will need to install the XPDF package, which includes "pdftotext." Once you have XPDF/pdftotext installed, you run the following PHP statement to get the PDF text:

$content = shell_exec('/usr/local/bin/pdftotext '.$filename.' -'); //dash at the end to output content

Reading DOC Files

Like the PDF example above, you'll need to download another package. This package is called Antiword. Here's the code to grab the Word DOC content:

$content = shell_exec('/usr/local/bin/antiword '.$filename);

The above code does NOT read DOCX files and does not (and purposely so) preserve formatting. There are other libraries that will preserve formatting but in our case, we just want to get at the text.

A special thank you to Jeremy Parrish for his help and insight with this task.

O'Reilly Velocity Conference
Save 20% with discount code AFF20

Recent Features

  • 9 Mind-Blowing WebGL Demos

    As much as developers now loathe Flash, we're still playing a bit of catch up to natively duplicate the animation capabilities that Adobe's old technology provided us.  Of course we have canvas, an awesome technology, one which I highlighted 9 mind-blowing demos.  Another technology available...

  • Page Visibility API

    One event that's always been lacking within the document is a signal for when the user is looking at a given tab, or another tab. When does the user switch off our site to look at something else? When do they come back?...

Incredible Demos

  • External Site Link Favorite Icons Using MooTools and CSS

    I recently came upon an interesting jQuery article about how you can retrieve all external links within a page, build the address of the site's favorite icon, and place the favorite icon along side the link. I've chosen a different approach which...

  • iPhone-Style Passwords Using MooTools PassShark

    Every once in a while I come across a plugin that blows me out of the water and the most recent culprit is PassShark: a MooTools plugin that duplicates the iPhone's method of showing/hiding the last character in a password field. This gem of...


  1. Cool. I wonder if there is any solution that doesn’t need extra software, i.e. a simple PHP Class or something. Maybe a future project? Anyway, this solution is pretty elegant and works well if you have full access to the server your page is on.

  2. @Simon Sigurdhsson: I don’t know of a pure-PHP way of doing this. I know that if you’re on Windows/IIS, you can use the COM library. Other than that, these methods are the only I know of.

  3. I’ve wanted to read DOC files before but came out with nothing that would work with my hosting since it’s shared hosting and there is no allowance for shell_exec.

  4. I have tried this approach long back, but it doesn’t work for all PDF versions. Have you tested this with all PDF Versions?

  5. This could come in very handy with some of my projects. Thanks!


  6. Looks good. Shame this requires extra software though. It would be good if you could mabey parse the document into an image.

  7. Simon

    For PDF’s Read all comments especially jorromer’s

    pdf to text. php manual page entry

  8. Max

    Is anyone able to provide a link to a 32 bit binary of the Linux version of antiword? I don’t have shell access to the server I work on.

  9. lame

    i have a project which entails i search through (.pdf/ .doc) files i this development will definately come handy! thnx, hope it does work or close to doing so…

  10. Koushik Ghoah

    Anyone please help me out.
    Currently I’m working in a project for that I have to read pdf, doc or docx file using php code from the localhost. Is it possible from localhost?Then please send me the code.

    Thank you,
    Koushik Ghosh

  11. That is cool, thanks for your post.

    You would like to post more shell tutorial.


    cong nguyen


  12. check this out – pdf to txt in pure php – http://community.livejournal.com/php/295413.html … back in 2005. leet. :|

    That should ease some future stress.. I hope that script is still functional, as I haven’t had a chance to try myself.

  13. Rony

    I need help about how to read bangla from doc file in php.If anyone know please send me code.It is very much urgent.

  14. \\.\

    How would you do this on a web hosted space, assuming that you have no access to services other than those set up for the package you buy, meaning that you can not install stuff on the remote computer?

  15. Anyone please help me out.
    Currently I’m working in a project for that I have to read and write doc or docx file using php xml code from the localhost. Is it possible from localhost?Then please send me the code or url.

    Thank you,

  16. ochi
    function openPdf()
    var omyFrame = document.getElementById("myFrame");
    omyFrame.src = "myFile.pdf";
  17. nanhe

    how can install xpdf on wamp server

  18. Thiru

    Thanks for ur help to giving the instructions for how to read data from pdf files.
    I did fallow the instructions whatever u have given.
    it is working fine for localhost which is on windows platform.
    Now i just wanted to run it on my web server…
    can u tell me what r the changes do i have to do and where?????????????
    Thanks inadvance……..

  19. tarun agarwal

    @Simon Sigurdhsson:

    can u get me the programming code for that search engine???…

  20. selvakumar

    @Thiru: thiru sir tell me how to store the content of word document into data base. please tell to me through my mail

  21. raji

    how to view the document file in the same page using php

  22. I just tested and its work, thanks for great article

  23. chaos

    antiword works like a charm from shell but i only get a slightly fucked up first line in my var if i run it from php, can u tell me what i’m doing wrong? :)

  24. SP

    I got the solution :
    $content = shell_exec(‘/usr/bin/antiword -f -w 0 formatting.doc’);
    I forgot about -w argument, it will give you whole line or you can define value of -w as required width of line like 30, 40 etc.

  25. Marcelo de Almeida Braga

    On the local server (wamp) the code works. In the web server (linux) do not get the file’s contents.

    $content = shell_exec(“pdftotext”. “sumario.pdf.” ‘ -‘);
    echo $ content;

    Thanks in advance.

  26. is there any windows version?

  27. i’ve been using this. not bad at all. good work around. sadly antiword can’t read docx files

  28. I wonder if anyone actually get it to work. I’ve tried but it doesn’t work for me. Dyo, did you actually get it to work? How did you do different or what kind of web server you running on? I’ve tried it on Ubuntu10.04, Apache2 with PHP5. can you share your code here? Here’s mine:

  29. $filename = '/var/www/myfiles/mydoc.doc';
    $content = shell_exec('/usr/local/bin/antiword '.$filename);
    echo $content;

  30. sarfaraz ali

    For getting full string search from pdf then visit to sarfarazali.co.cc

  31. prase

    Hello i have tried your tutorial, i tried in local host. But nothing happen, can you give me solution ??

  32. Dude…. this is epic.

    I visited this page 2 days ago trying to develop a search engine for PDFs and had no clue what this meant. Now it makes sense to me and I’m going to use this. Thanks!

  33. Hello david, I used your xpdf and antiword.

    xpdf is working well with my php but antiword is not executing if antiword folder is not installed on c:/antiword/bin directory.

    I dont want to execute from the c:/ drive i would like to run it from php from my htdocs directory but its not working.

    How can I do this can anyone help???


    CODE FOR XPDF( Working ):
    $page_content = shell_exec('C:/xampp/htdocs/search-includes/xpdfbin-win-3.03/bin32/pdftotext '.$filename.' -');

    CODE FOR ANTIWORD: ( This is not working )
    $page_content = shell_exec("C:/xampp/htdocs/search-includes/antiword/bin/antiword ".$filename);

    CODE FOR ANTIWORD:( Working )
    $page_content = shell_exec("C:/antiword/bin/antiword ".$filename);

  34. Nick

    @Randy – search engine for PDFs

    Hi Randy,

    iF You have root access to your server, You could try Apache Tomcat with Apache SOLR and You will obtain the same effect for PDF, Word, and some other formats – should take a little time to check which formats are supported.

    Kind Regards,

  35. Bhashitha

    Thankz ‘shivarajrh’ your piece of cord realy helps me….)

  36. Bastien


    I’d really like to use those two packages but i don’t really know how to install them ( I do have ssh access to my apache server but don’t know how to install this kind of package. )
    Could you help me ? I searched a lot on the web but did not find an adequat solution and don’t wan’t to make mistakes et troubles to my system.

  37. nice article… it would be better if you have made a demo

  38. It works well , i am creating a mobile handler that will open PDF files even in mobile phones without downloading it actually.

    I tested the code by installing XPDF and open files like this

    Thanks again.

  39. I tried this code
    $c=shell_exec(‘pdftotext ‘.$file_name);
    header(‘Content-Type: text/plain’);
    echo $c;

  40. oaattia

    How can i install antiword and XPDF on my vps server, my VPS server runs redhat
    please any help appreciated …

  41. Raheem Shaik

    hii.. i want to know how to install those PACKAGES in linux.. please can anyone tell me the steps to install it..

  42. My hosting has been disabled ‘shell_exec’ function. What can i do it ? Thank you.

  43. Dolar

    please say me it’s work in windows os?
    pls pls reply me!!

    Thanks! in advanced.

Wrap your code in <pre class="{language}"></pre> tags, link to a GitHub gist, JSFiddle fiddle, or CodePen pen to embed!

Recently on David Walsh Blog

  • Prevent Chrome from Translating a Page

    A while back I shared my favorite Google Chrome extension:  Google Art Project.  I've enjoyed seeing beautiful art when I open a new tab -- it's brought genuine happiness to my day, however small that happiness may be.  About a week ago, however, the art presented had...

  • Create Any Type Of Website With These Multi-Purpose Themes

    We have selected what we believe are the very best multipurpose WordPress themes on the market today. Our list contains a number of best sellers, several newcomers that are proving to be highly popular, and a few themes that are ideal for creating the types of...

  • An Introduction to Static Site Generators

    Static site generators seem to have been becoming more and more popular recently, but they’re not one of those ephemeral novelty things that grow in popularity as quickly as they fall into oblivion shortly after. For over a decade, many different projects — 394 of...

  • Automated Tests for Visual Responsive Layouts

    Today it's all about testing. In 2015, many developers knows about TDD and I personally think that testing is one of the key for quality products. But what about testing in a Front-end environment? How do you guys write your tests for a responsive page or...

  • Getting Dicey With Flexbox

    What if you could build complex CSS layouts in minutes? Flexbox is a new CSS layout spec that makes it easy to construct dynamic layouts. With flexbox, vertical centering, same-height columns, reordering, and direction agnosticism are a piece of cake. There's a popular myth floating around that...