May 18 2005
Last year I wrote a piece for Uplink on using Xpdf to convert PDF documents into text tables, but that piece focused on using xpdf on Win32 systems. Here’s an adapted guide to installing and using Xpdf on the Mac.
Here’s a question that should have a familiar ring: How do I get text out of a PDF file?
Painfully, if your experience has been anything like mine.
The mere existence of Adobe Acrobat files has been a boon for reporters because governments and agencies everywhere have been able to make documents broadly available over the Internet. It’s hard not to love that.
But if you’ve ever tried getting tables out of a PDF document – and we’ve all tried – the results usually aren’t worth the effort. Until now.
A free command line utility called Xpdf will save you time and aggravation. It will, in most circumstances, enable you to go from PDF to Excel in a matter of seconds, rather than minutes or hours. Did I mention that it’s free?
You can find Xpdf and it comes in packages for Windows and Linux/Unix. OS X users should download the source code (at this writing the file is xpdf-3.00.tar.gz) to their desktop. That file will expand into a folder labeled xpdf-3.00. Open up the Terminal and type the following (hit return after each step):
cd Desktop cd xpdf-3.00 ./configure
This will take a minute or so. Then type:
Again, you’ll wait a few minutes until it finishes, then:
You may have to use “sudo make” and, when prompted, enter the password for your computer.
Once xpdf installs, you can put a PDF file anywhere in your home directory (I usually have a single folder for this) and navigate to that directory in the Terminal using “cd /location of file”, and then typing:
pdftotext -layout pdfname.pdf
Depending on the size of the PDF file, your output text file (with the same name as the original) will be in the same folder in a matter of seconds.
Let’s go through the command line syntax. First, the command “pdftotext” is required for this process, and “pdf2text” won’t work. The “-layout” tag tells Xpdf that you want to preserve the layout that the PDF file uses, which keeps the text in those nice, clean tables. And you need to have the fullname of the file (I recommend a single-word name, even though Windows supports filenames with spaces). That’s it.
The resulting text file will be the entire text of the PDF, meaning that you may have to wade through pages of text in order to get to your tables. The preservation of the PDF’s layout means that if a page contained two tables side-by-side, that’s the way they will look in the text file, too.
Xpdf doesn’t work in all instances; specifically, it won’t convert PDFs that have been locked by their creators. Don’t bother asking the author of Xpdf, either, as he has posted a message on his Web site indicating that he will not add that ability.
But for most government documents, Xpdf can be a huge time-saver and allow you to spend more time actually analyzing data rather than trying to free it from the confines of the PDF.