Read PDF and Word DOC Files Using PHP

Written by David Walsh on Friday, January 2, 2009


One of my customers has an insane amount of PDF and Microsoft Word DOC files on their website. It’s core to their online services so it’s not as though they’re garbage files up on the server. My customer wanted their website’s search engine (Sphider) to read these PDF files and DOC files so that their clients could get at the documents they needed without going through a bunch of summary pages to get them. I was successful in the task, so let me show you how to read PDF and DOC files using PHP.

Reading PDF Files

To read PDF files, you will need to install the XPDF package, which includes “pdftotext.” Once you have XPDF/pdftotext installed, you run the following PHP statement to get the PDF text:

$content = shell_exec('/usr/local/bin/pdftotext '.$filename.' -'); //dash at the end to output content

Reading DOC Files

Like the PDF example above, you’ll need to download another package. This package is called Antiword. Here’s the code to grab the Word DOC content:

$content = shell_exec('/usr/local/bin/antiword '.$filename);

The above code does NOT read DOCX files and does not (and purposely so) preserve formatting. There are other libraries that will preserve formatting but in our case, we just want to get at the text.

A special thank you to Jeremy Parrish for his help and insight with this task.


Follow via RSS Epic Discussion

Commenter Avatar January 02 / #

Cool. I wonder if there is any solution that doesn’t need extra software, i.e. a simple PHP Class or something. Maybe a future project? Anyway, this solution is pretty elegant and works well if you have full access to the server your page is on.

David Walsh January 02 / #
david says:

@Simon Sigurdhsson: I don’t know of a pure-PHP way of doing this. I know that if you’re on Windows/IIS, you can use the COM library. Other than that, these methods are the only I know of.

Commenter Avatar January 02 / #

I’ve wanted to read DOC files before but came out with nothing that would work with my hosting since it’s shared hosting and there is no allowance for shell_exec.

Commenter Avatar January 02 / #

I have tried this approach long back, but it doesn’t work for all PDF versions. Have you tested this with all PDF Versions?

Commenter Avatar January 02 / #
Brenley says:

This could come in very handy with some of my projects. Thanks!

-Brenelz

Commenter Avatar January 13 / #

Looks good. Shame this requires extra software though. It would be good if you could mabey parse the document into an image.

Commenter Avatar January 14 / #
Simon says:

For PDF’s Read all comments especially jorromer’s

pdf to text. php manual page entry

Commenter Avatar February 01 / #
Max says:

Is anyone able to provide a link to a 32 bit binary of the Linux version of antiword? I don’t have shell access to the server I work on.

Commenter Avatar February 03 / #
lame says:

i have a project which entails i search through (.pdf/ .doc) files i this development will definately come handy! thnx, hope it does work or close to doing so…

Commenter Avatar February 05 / #
Leo Bonnafé says:

I think the easiest ways to read, generate and convert DOC and DOCX files in PHP is with phpLiveDocx – http://www.phplivedocx.org

This approach certainly saves the potentially very risky practice of shelling out with shell_exec() :-)

Leo

Commenter Avatar March 03 / #
Koushik Ghoah says:

Anyone please help me out.
Currently I’m working in a project for that I have to read pdf, doc or docx file using php code from the localhost. Is it possible from localhost?Then please send me the code.

Thank you,
Koushik Ghosh

Commenter Avatar April 14 / #

That is cool, thanks for your post.

You would like to post more shell tutorial.

:D

cong nguyen

http://www.neoob.com/

Commenter Avatar April 18 / #

check this out – pdf to txt in pure php – http://community.livejournal.com/php/295413.html … back in 2005. leet. :|

That should ease some future stress.. I hope that script is still functional, as I haven’t had a chance to try myself.

Commenter Avatar June 17 / #
Rony says:

I need help about how to read bangla from doc file in php.If anyone know please send me code.It is very much urgent.

Commenter Avatar June 26 / #
\\.\ says:

How would you do this on a web hosted space, assuming that you have no access to services other than those set up for the package you buy, meaning that you can not install stuff on the remote computer?

Commenter Avatar July 07 / #
Mindaugas says:

Anyone please help me out.
Currently I’m working in a project for that I have to read and write doc or docx file using php xml code from the localhost. Is it possible from localhost?Then please send me the code or url.

Thank you,
Mindaugas

Commenter Avatar September 13 / #
ochi says:

function openPdf()
{
var omyFrame = document.getElementById(“myFrame”);
omyFrame.style.display=”block”;
omyFrame.src = “myFile.pdf”;
}

© David Walsh 2007-2010. Contact David Walsh. Powered by the remarkable MooTools javascript framework.