View Full Version : Extracting text with formatting from PDF


nekokami
01-24-2007, 09:20 PM
Hi folks,

I have a PDF file that I'd like to get the text out of while retaining the formatting. The file is too large to simply select all text and copy/paste. (I get a memory error when I try to do this.) Besides, I'd like to not take the page numbers, since they won't be relevant on the device I'll be reading on (eBw 1150). The ABC PDF converter gets the text, but loses the formatting. I can't afford a full copy of Acrobat. Other extractors I've tried seem to assume one has Word installed (I don't).

I usually use a Mac, but I do have a PC available. Can anyone suggest a good, preferably low-cost program to convert PDF to something more portable, e.g. HTML or RTF? (I guess I could use the trial of Acrobat Professional for now, but I'd like a more long-term solution.)

Thanks!

PS - I've also tried TextLightning and Trapeze on the Mac. Neither worked, possibly because they didn't like the font. TextLightning kept crashing, and the limited output it did manage to provide didn't parse. It looked like raw PDF code. Trapeze just produced junk.

jęd
01-25-2007, 03:10 AM
I would try the various command line converters for this, or write a perl/java/php program...

Alexander Turcic
01-25-2007, 05:42 AM
Abbyy Transformer (http://www.pdftransformer.com/) works well too, but it's payware. They have a demo you can try.

nekokami
01-25-2007, 08:57 AM
Thanks, I'll try Abbyy Transformer, but $99 is too steep for me to use it once I'm past the demo.

It turns out that there is an additional wrinkle. Text formatting (italics and some other changes) were implemented using different fonts, rather than font styles. Copy and paste doesn't seem to preserve these different fonts, so I lose formatting even in the copy-paste-to-Word method.

@jęd, do you recommend any particular command-line converter? I write in perl and (to a lesser extent) php, but I really don't have time to write code right now.

nekokami
01-25-2007, 06:42 PM
The plot thickens further: I have a copy of Readiris OCR, so I tried pulling this PDF file in to see if I could just OCR it. All I see in Readiris is boxes instead of letters. I tried a different PDF file and it worked fine (well, mostly fine--usual OCR type errors). Note that in the "thumbnail preview" mode on the Mac in the Finder, I also see boxes instead of text. Also, in the "Preview" application on the Mac I see boxes. (This isn't surprising, as I strongly suspect these two bits of software use the same code.)

Does anyone here know enough about PDF to guess what's happening? Again, when I look at the fonts (in Document Properties in Acrobat Reader) I see pretty weird names, e.g. "TTE1D974C0t00 (Embedded Subset)". It's a truetype font, but the encoding is listed as "Custom." In files that behave more normally I see recognizeable font names (variations on Arial or Times New Roman) and encoding of "Ansi". Does anyone know how to work around this problem? Maybe I'm going to need the full version of Acrobat after all....

pclewis
01-25-2007, 10:35 PM
Here is a program that is cheap, $12.95, that works OK. Always some rework required. PDF is an output file, so it is what it is.

http://www.thebeatlesforever.com/processtext/abcpdf.html

nekokami
01-26-2007, 07:37 AM
@pclewis, thanks. As I mentioned above, I've tried that one already. It does work, it just loses formatting because the text is formatted using different fonts, rather than text styles. I think I'm probably going to have to find someone with a copy of Acrobat Standard (or Pro) that I can use to change the fonts.

Thanks anyway

nekokami
01-28-2007, 09:59 AM
In case anyone is curious, I did eventually determine that the problem was the custom encoding in the PDF, and I solved the problem by finding an alternate (HTML) source of the file. :/ Yet another good example of why PDF is not a good format for source files. (I'm planning to spend today learning LaTeX.)

wallcraft
01-28-2007, 10:18 PM
Foxit (http://www.foxitsoftware.com/products/) Reader Pro Pack might be a possibility. I only have the free Reader - which seems to maintain formating in its text view. The Pro Pack ($39) is needed for full-file text conversion.

nekokami
01-29-2007, 09:18 AM
Looks to me like Foxit's output is plain text, but at least the editor might help if I have to sort out a problem like this in the future.

Meanwhile, the HTML file I have doesn't seem to want to import using the eBw Librarian software. It spits out warnings about unknown fonts, then hits a fatal error. Oddly enough, two very similar files imported just fine. Fortunately, HTML converts to RTF fairly easily, so I can try that next.

jęd
01-29-2007, 10:27 AM
@jęd, do you recommend any particular command-line converter? I write in perl and (to a lesser extent) php, but I really don't have time to write code right now.

Pdftohtml...?

nekokami
01-29-2007, 12:36 PM
Thanks. I have come across so many links that I'd given up checking them all. I'm especially happy to see that it's available for OSX. :)

Update: pdftohtml also does not successfully convert this file. It is confused by the custom encoding, presumably, and outputs strings of punctuation instead of text.

sartori
02-27-2007, 07:04 PM
Nekokami -

I have one way that saves most of the formatting - if you have a gmail account email the pdf to yourself as an attachment. Then view the message and click view as html on the attachment. Gmail does split the pdf into pages but it does keep italics etc.

I figure it should be easy to remove the page headers and breaks from the html source.

Hope it helps,

Sartori

nekokami
02-27-2007, 07:16 PM
Thanks, that's a good trick. I'll have to try it.

I eventually found a version of the file that I could reformat, and I have done so, but I'll have to create a gmail account and try this to see if it would work on files with custom encoding.

eimert
02-28-2007, 05:17 AM
Thanks. I have come across so many links that I'd given up checking them all. I'm especially happy to see that it's available for OSX. :)

Update: pdftohtml also does not successfully convert this file. It is confused by the custom encoding, presumably, and outputs strings of punctuation instead of text.

Hi,
you could also use pdaConverter (http://www.jakewalk.de/wiki/pmwiki.php?n=MySoftware.PdaConverter). It is freeware (Windows). I used it quite often for converting pdf files for reading on my palm - ages ago :-)
Cheers,
Klaus

nekokami
02-28-2007, 05:38 PM
Hi,
you could also use pdaConverter (http://www.jakewalk.de/wiki/pmwiki.php?n=MySoftware.PdaConverter). It is freeware (Windows). I used it quite often for converting pdf files for reading on my palm - ages ago :-)
Cheers,
Klaus
Thanks... though I don't usually use windows and wasn't planning on reading this on a Palm or other PDA....

eimert
03-01-2007, 05:27 AM
Thanks... though I don't usually use windows and wasn't planning on reading this on a Palm or other PDA....

No, you misunderstood (or I didn't explain it well enough - English is not my first language). I should have said that I used it for converting pdf files to txt or html (for reading on my Palm), NOT to prc or pdb (for some reason I never liked Palm's eReader). So, you could use it for your purpose, too. Only Windows seems to be a problem there.

Cheers,
Klaus

NatCh
03-01-2007, 09:43 AM
Oh. I had looked at the web page and decided that it only outputs PDB/PRC formats. So that's not correct, then? What formats does it output besides TXT and HTML (those would probably do well enough, but I'm just checking if there are others). :)

nekokami
03-01-2007, 12:00 PM
No, you misunderstood (or I didn't explain it well enough - English is not my first language). I should have said that I used it for converting pdf files to txt or html (for reading on my Palm), NOT to prc or pdb (for some reason I never liked Palm's eReader). So, you could use it for your purpose, too. Only Windows seems to be a problem there.

Cheers,
Klaus
Sorry Klaus, your English is fine! But when I looked at the website all I saw were palm formats as output. That was the source of my confusion. Thanks for the pointer!

micro
03-04-2007, 06:25 AM
for the mac, have you tried the obvious? Preview?

I'd try cropping the PDF using the select tool and then crop (under tools). Do this to remove headers, footers, pages numbers etc etc etc. You should be able to reduce this to just your content only.

Then change tools and use the text tool, do a select all, and bingo, formatted text is in your clipboard. Paste into whatever app you want and it should work.

From here you can then massage it to however you want it to look.

This is how I do it using mac OS 10.4.8 anyhow,

good luck and if you need help, just ask,

micro

nekokami
03-04-2007, 08:45 PM
micro, thanks for the suggestions. See posts #1 and #4 for why the copy/paste solution didn't work. See #5 for my experience with Preview.

Recently I realized that we have Acrobat Professional at my new job. I've already "solved" my problem by obtaining a differently formatted copy of this file, but I may try loading the original up into Acrobat Professional to see if I can get around the custom encoding that way.

yvanleterrible
03-05-2007, 08:27 AM
Did you try the 'save as text' feature in Acrobat Reader? (I have'nt read all the posts)

nekokami
03-05-2007, 09:18 AM
Did you try the 'save as text' feature in Acrobat Reader? (I have'nt read all the posts)
"Save as text" produces junk output. :( It's that *^&)#% custom encoding.