Read PDF and Word DOC Files Using PHP
Written by David Walsh on Friday, January 2, 2009
One of my customers has an insane amount of PDF and Microsoft Word DOC files on their website. It’s core to their online services so it’s not as though they’re garbage files up on the server. My customer wanted their website’s search engine (Sphider) to read these PDF files and DOC files so that their clients could get at the documents they needed without going through a bunch of summary pages to get them. I was successful in the task, so let me show you how to read PDF and DOC files using PHP.
Reading PDF Files
To read PDF files, you will need to install the XPDF package, which includes “pdftotext.” Once you have XPDF/pdftotext installed, you run the following PHP statement to get the PDF text:
$content = shell_exec('/usr/local/bin/pdftotext '.$filename.' -'); //dash at the end to output contentReading DOC Files
Like the PDF example above, you’ll need to download another package. This package is called Antiword. Here’s the code to grab the Word DOC content:
$content = shell_exec('/usr/local/bin/antiword '.$filename);The above code does NOT read DOCX files and does not (and purposely so) preserve formatting. There are other libraries that will preserve formatting but in our case, we just want to get at the text.
A special thank you to Jeremy Parrish for his help and insight with this task.
Cool. I wonder if there is any solution that doesn’t need extra software, i.e. a simple PHP Class or something. Maybe a future project? Anyway, this solution is pretty elegant and works well if you have full access to the server your page is on.
@Simon Sigurdhsson: I don’t know of a pure-PHP way of doing this. I know that if you’re on Windows/IIS, you can use the COM library. Other than that, these methods are the only I know of.
I’ve wanted to read DOC files before but came out with nothing that would work with my hosting since it’s shared hosting and there is no allowance for shell_exec.
I have tried this approach long back, but it doesn’t work for all PDF versions. Have you tested this with all PDF Versions?
This could come in very handy with some of my projects. Thanks!
-Brenelz
Looks good. Shame this requires extra software though. It would be good if you could mabey parse the document into an image.
For PDF’s Read all comments especially jorromer’s
pdf to text. php manual page entry
Is anyone able to provide a link to a 32 bit binary of the Linux version of antiword? I don’t have shell access to the server I work on.
i have a project which entails i search through (.pdf/ .doc) files i this development will definately come handy! thnx, hope it does work or close to doing so…
I think the easiest ways to read, generate and convert DOC and DOCX files in PHP is with phpLiveDocx – http://www.phplivedocx.org
This approach certainly saves the potentially very risky practice of shelling out with shell_exec() :-)
Leo
Anyone please help me out.
Currently I’m working in a project for that I have to read pdf, doc or docx file using php code from the localhost. Is it possible from localhost?Then please send me the code.
Thank you,
Koushik Ghosh
That is cool, thanks for your post.
You would like to post more shell tutorial.
:D
cong nguyen
http://www.neoob.com/
check this out – pdf to txt in pure php – http://community.livejournal.com/php/295413.html … back in 2005. leet. :|
That should ease some future stress.. I hope that script is still functional, as I haven’t had a chance to try myself.
I need help about how to read bangla from doc file in php.If anyone know please send me code.It is very much urgent.
How would you do this on a web hosted space, assuming that you have no access to services other than those set up for the package you buy, meaning that you can not install stuff on the remote computer?
Anyone please help me out.
Currently I’m working in a project for that I have to read and write doc or docx file using php xml code from the localhost. Is it possible from localhost?Then please send me the code or url.
Thank you,
Mindaugas
function openPdf()
{
var omyFrame = document.getElementById(“myFrame”);
omyFrame.style.display=”block”;
omyFrame.src = “myFile.pdf”;
}
Feeds & Profiles
Wynq Web Labs
I founded Wynq Web Labs, my consulting firm, in early 2008. Wynq follows my own principles of not designing the website but enhancing it with javascript (MooTools, jQuery), CSS, and AJAX enhancements. Click here to learn more about Wynq Web Labs.
Sponsors
Tutorials & Topics
Script & Style Tweets
About David Walsh
About the Blog
Webmasters need not apply.
MooTools FTW!
I am also a Core Developer for the MooTools Javascript Framework, the most flexible, functional javascript framework available today. Grab the "MooTools FTW" twibbon, follow @MooTools on Twitter, wear MooTools merchandise, and download the latest build to show your support for the framework!
Friends
Aaron Newton / Christoph Pojer / CSS-Tricks /
Darren Waddell / Eric Wendelin / Guiller Rauch /
Mark Obcena / MooTools
Static
About Me / Advertise / Calendar / Chat / Contact Me / Demos / GitHub / Moo Plugin Builder / Network / Pastebin / Post Archive / Web Tools
Popular Topics
AJAX / CSS / HTML / Javascript / jQuery / MooTools / PHP / Theories & Ideas / Usability
Profiles
Facebook / Forge / GitHub / Linked In / Twitter
Interviews
CSS Tricks / The FaceOff Show / NetTuts
Mentions
DiggNation / Official jQuery Podcast / Smashing Magazine