01-24-2007, 09:20 PM | #1 |
fruminous edugeek
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
|
Extracting text with formatting from PDF
Hi folks,
I have a PDF file that I'd like to get the text out of while retaining the formatting. The file is too large to simply select all text and copy/paste. (I get a memory error when I try to do this.) Besides, I'd like to not take the page numbers, since they won't be relevant on the device I'll be reading on (eBw 1150). The ABC PDF converter gets the text, but loses the formatting. I can't afford a full copy of Acrobat. Other extractors I've tried seem to assume one has Word installed (I don't). I usually use a Mac, but I do have a PC available. Can anyone suggest a good, preferably low-cost program to convert PDF to something more portable, e.g. HTML or RTF? (I guess I could use the trial of Acrobat Professional for now, but I'd like a more long-term solution.) Thanks! PS - I've also tried TextLightning and Trapeze on the Mac. Neither worked, possibly because they didn't like the font. TextLightning kept crashing, and the limited output it did manage to provide didn't parse. It looked like raw PDF code. Trapeze just produced junk. Last edited by nekokami; 01-24-2007 at 09:31 PM. Reason: TextLightning and Trapeze |
01-25-2007, 03:10 AM | #2 |
Evangelist
Posts: 458
Karma: 293
Join Date: May 2006
|
I would try the various command line converters for this, or write a perl/java/php program...
|
Advert | |
|
01-25-2007, 05:42 AM | #3 |
Fully Converged
Posts: 18,170
Karma: 14021202
Join Date: Oct 2002
Location: Switzerland
Device: Too many to count here.
|
Abbyy Transformer works well too, but it's payware. They have a demo you can try.
|
01-25-2007, 08:57 AM | #4 |
fruminous edugeek
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
|
Thanks, I'll try Abbyy Transformer, but $99 is too steep for me to use it once I'm past the demo.
It turns out that there is an additional wrinkle. Text formatting (italics and some other changes) were implemented using different fonts, rather than font styles. Copy and paste doesn't seem to preserve these different fonts, so I lose formatting even in the copy-paste-to-Word method. @jęd, do you recommend any particular command-line converter? I write in perl and (to a lesser extent) php, but I really don't have time to write code right now. |
01-25-2007, 06:42 PM | #5 |
fruminous edugeek
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
|
The plot thickens further: I have a copy of Readiris OCR, so I tried pulling this PDF file in to see if I could just OCR it. All I see in Readiris is boxes instead of letters. I tried a different PDF file and it worked fine (well, mostly fine--usual OCR type errors). Note that in the "thumbnail preview" mode on the Mac in the Finder, I also see boxes instead of text. Also, in the "Preview" application on the Mac I see boxes. (This isn't surprising, as I strongly suspect these two bits of software use the same code.)
Does anyone here know enough about PDF to guess what's happening? Again, when I look at the fonts (in Document Properties in Acrobat Reader) I see pretty weird names, e.g. "TTE1D974C0t00 (Embedded Subset)". It's a truetype font, but the encoding is listed as "Custom." In files that behave more normally I see recognizeable font names (variations on Arial or Times New Roman) and encoding of "Ansi". Does anyone know how to work around this problem? Maybe I'm going to need the full version of Acrobat after all.... |
Advert | |
|
01-25-2007, 10:35 PM | #6 |
Enthusiast
Posts: 32
Karma: 14644
Join Date: Jan 2007
Location: Texas
Device: Kobo Aura One
|
PDF Converter
Here is a program that is cheap, $12.95, that works OK. Always some rework required. PDF is an output file, so it is what it is.
http://www.thebeatlesforever.com/pro...xt/abcpdf.html |
01-26-2007, 07:37 AM | #7 |
fruminous edugeek
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
|
@pclewis, thanks. As I mentioned above, I've tried that one already. It does work, it just loses formatting because the text is formatted using different fonts, rather than text styles. I think I'm probably going to have to find someone with a copy of Acrobat Standard (or Pro) that I can use to change the fonts.
Thanks anyway |
01-28-2007, 09:59 AM | #8 |
fruminous edugeek
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
|
In case anyone is curious, I did eventually determine that the problem was the custom encoding in the PDF, and I solved the problem by finding an alternate (HTML) source of the file. :/ Yet another good example of why PDF is not a good format for source files. (I'm planning to spend today learning LaTeX.)
|
01-28-2007, 10:18 PM | #9 |
reader
Posts: 6,975
Karma: 5183568
Join Date: Mar 2006
Location: Mississippi, USA
Device: Kindle 3, Kobo Glo HD
|
Foxit Reader Pro Pack might be a possibility. I only have the free Reader - which seems to maintain formating in its text view. The Pro Pack ($39) is needed for full-file text conversion.
|
01-29-2007, 09:18 AM | #10 |
fruminous edugeek
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
|
Looks to me like Foxit's output is plain text, but at least the editor might help if I have to sort out a problem like this in the future.
Meanwhile, the HTML file I have doesn't seem to want to import using the eBw Librarian software. It spits out warnings about unknown fonts, then hits a fatal error. Oddly enough, two very similar files imported just fine. Fortunately, HTML converts to RTF fairly easily, so I can try that next. |
01-29-2007, 10:27 AM | #11 | |
Evangelist
Posts: 458
Karma: 293
Join Date: May 2006
|
Quote:
|
|
01-29-2007, 12:36 PM | #12 |
fruminous edugeek
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
|
Thanks. I have come across so many links that I'd given up checking them all. I'm especially happy to see that it's available for OSX.
Update: pdftohtml also does not successfully convert this file. It is confused by the custom encoding, presumably, and outputs strings of punctuation instead of text. Last edited by nekokami; 01-29-2007 at 08:14 PM. Reason: didn't work |
02-27-2007, 07:04 PM | #13 |
Connoisseur
Posts: 54
Karma: 29
Join Date: Oct 2006
|
Nekokami -
I have one way that saves most of the formatting - if you have a gmail account email the pdf to yourself as an attachment. Then view the message and click view as html on the attachment. Gmail does split the pdf into pages but it does keep italics etc. I figure it should be easy to remove the page headers and breaks from the html source. Hope it helps, Sartori |
02-27-2007, 07:16 PM | #14 |
fruminous edugeek
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
|
Thanks, that's a good trick. I'll have to try it.
I eventually found a version of the file that I could reformat, and I have done so, but I'll have to create a gmail account and try this to see if it would work on files with custom encoding. |
02-28-2007, 05:17 AM | #15 | |
Connoisseur
Posts: 58
Karma: 140
Join Date: Jan 2007
Location: Germany
Device: Dell Axim X50v
|
Quote:
you could also use pdaConverter (http://www.jakewalk.de/wiki/pmwiki.p...e.PdaConverter). It is freeware (Windows). I used it quite often for converting pdf files for reading on my palm - ages ago :-) Cheers, Klaus |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Text formatting | jerrywojo | Ectaco jetBook | 4 | 01-19-2010 03:37 PM |
Tool for extracting pdf bookmarks | geraschenko | iRex | 1 | 10-24-2009 03:42 PM |
Extracting text | UncleIvor | Sony Reader | 3 | 09-11-2009 01:56 PM |
Text tool for formatting Gutenberg text files | bob_ninja | Workshop | 5 | 11-13-2007 12:28 PM |
PRS-500 Text Formatting Tool | tesseract420 | Sony Reader Dev Corner | 5 | 09-13-2007 05:36 PM |