Read PDF and Word DOC Files Using PHP
One of my customers has an insane amount of PDF and Microsoft Word DOC files on their website. It's core to their online services so it's not as though they're garbage files up on the server. My customer wanted their website's search engine (Sphider) to read these PDF files and DOC files so that their clients could get at the documents they needed without going through a bunch of summary pages to get them. I was successful in the task, so let me show you how to read PDF and DOC files using PHP.
Reading PDF Files
To read PDF files, you will need to install the XPDF package, which includes "pdftotext." Once you have XPDF/pdftotext installed, you run the following PHP statement to get the PDF text:
$content = shell_exec('/usr/local/bin/pdftotext '.$filename.' -'); //dash at the end to output content
Reading DOC Files
Like the PDF example above, you'll need to download another package. This package is called Antiword. Here's the code to grab the Word DOC content:
$content = shell_exec('/usr/local/bin/antiword '.$filename);
The above code does NOT read DOCX files and does not (and purposely so) preserve formatting. There are other libraries that will preserve formatting but in our case, we just want to get at the text.
A special thank you to Jeremy Parrish for his help and insight with this task.