JISCMail - CCP4BB Archives

Depends on what you mean by "meaningful". Obviously, you can take a screenshot of the PDF page and import that into Word as one big photo, but I imagine that is not what you want to do?

If you mean to reverse the "convert to PDF" process and get back the original Word document you used to make the PDF, then I'm afraid that is basically impossible. A lot of information is lost in the "printing" process. PDF is actually a type-setting language, so it contains things like "there is a letter "a" at coordinates x,y on page 1". Or, worse, some PDFs are actually just scanned images with no text in them at all. For these, all you can do is print out the PDF and then scan it back in with an OCR-capable scanner. Sounds awful, but modern OCR is actually surprisingly smart.

I generally use the poppler-utils programs (pdftotext, pdfimages, pdftops, pstotext, pdftohtml, etc.) to extract computer-readable meaning from PDF documents. I find they tend to do a pretty good job at figuring out the formatting. There are various options and flags to choose from, and sometimes you get better results using an intermediate format (pdftops followed by pstotext), but your mileage may vary. Remember, since PDF is a type-setting format converting it into something else is essentially an image recognition algorithm. Usually, you will get a word here or there that is split into two, or sometimes two words get stuck together. So, be sure to spell check.

But yes, perhaps the easiest thing to do is load up the PDF in Acrobat, cut-and-paste all the text into Word, and then cut-and-paste each figure. Then spend some time re-formatting, etc. Or, alternately, you could spend some time trying to find the original Word document. The latter is somewhat easier to automate.

Sorry!

-James Holton
MAD Scientist

On 9/1/2012 3:48 AM, Rex Palmer wrote: