Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > PDF

Notices

Reply
 
Thread Tools Search this Thread
Old 01-24-2007, 09:20 PM   #1
nekokami
fruminous edugeek
nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.
 
nekokami's Avatar
 
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
Extracting text with formatting from PDF

Hi folks,

I have a PDF file that I'd like to get the text out of while retaining the formatting. The file is too large to simply select all text and copy/paste. (I get a memory error when I try to do this.) Besides, I'd like to not take the page numbers, since they won't be relevant on the device I'll be reading on (eBw 1150). The ABC PDF converter gets the text, but loses the formatting. I can't afford a full copy of Acrobat. Other extractors I've tried seem to assume one has Word installed (I don't).

I usually use a Mac, but I do have a PC available. Can anyone suggest a good, preferably low-cost program to convert PDF to something more portable, e.g. HTML or RTF? (I guess I could use the trial of Acrobat Professional for now, but I'd like a more long-term solution.)

Thanks!

PS - I've also tried TextLightning and Trapeze on the Mac. Neither worked, possibly because they didn't like the font. TextLightning kept crashing, and the limited output it did manage to provide didn't parse. It looked like raw PDF code. Trapeze just produced junk.

Last edited by nekokami; 01-24-2007 at 09:31 PM. Reason: TextLightning and Trapeze
nekokami is offline   Reply With Quote
Old 01-25-2007, 03:10 AM   #2
jęd
Evangelist
jęd has a complete set of Star Wars action figures.jęd has a complete set of Star Wars action figures.jęd has a complete set of Star Wars action figures.
 
Posts: 458
Karma: 293
Join Date: May 2006
I would try the various command line converters for this, or write a perl/java/php program...
jęd is offline   Reply With Quote
Advert
Old 01-25-2007, 05:42 AM   #3
Alexander Turcic
Fully Converged
Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.Alexander Turcic ought to be getting tired of karma fortunes by now.
 
Alexander Turcic's Avatar
 
Posts: 18,163
Karma: 14021202
Join Date: Oct 2002
Location: Switzerland
Device: Too many to count here.
Abbyy Transformer works well too, but it's payware. They have a demo you can try.
Alexander Turcic is offline   Reply With Quote
Old 01-25-2007, 08:57 AM   #4
nekokami
fruminous edugeek
nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.
 
nekokami's Avatar
 
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
Thanks, I'll try Abbyy Transformer, but $99 is too steep for me to use it once I'm past the demo.

It turns out that there is an additional wrinkle. Text formatting (italics and some other changes) were implemented using different fonts, rather than font styles. Copy and paste doesn't seem to preserve these different fonts, so I lose formatting even in the copy-paste-to-Word method.

@jęd, do you recommend any particular command-line converter? I write in perl and (to a lesser extent) php, but I really don't have time to write code right now.
nekokami is offline   Reply With Quote
Old 01-25-2007, 06:42 PM   #5
nekokami
fruminous edugeek
nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.
 
nekokami's Avatar
 
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
The plot thickens further: I have a copy of Readiris OCR, so I tried pulling this PDF file in to see if I could just OCR it. All I see in Readiris is boxes instead of letters. I tried a different PDF file and it worked fine (well, mostly fine--usual OCR type errors). Note that in the "thumbnail preview" mode on the Mac in the Finder, I also see boxes instead of text. Also, in the "Preview" application on the Mac I see boxes. (This isn't surprising, as I strongly suspect these two bits of software use the same code.)

Does anyone here know enough about PDF to guess what's happening? Again, when I look at the fonts (in Document Properties in Acrobat Reader) I see pretty weird names, e.g. "TTE1D974C0t00 (Embedded Subset)". It's a truetype font, but the encoding is listed as "Custom." In files that behave more normally I see recognizeable font names (variations on Arial or Times New Roman) and encoding of "Ansi". Does anyone know how to work around this problem? Maybe I'm going to need the full version of Acrobat after all....
nekokami is offline   Reply With Quote
Advert
Old 01-25-2007, 10:35 PM   #6
pclewis
Enthusiast
pclewis is less competitive than you.pclewis is less competitive than you.pclewis is less competitive than you.pclewis is less competitive than you.pclewis is less competitive than you.pclewis is less competitive than you.pclewis is less competitive than you.pclewis is less competitive than you.pclewis is less competitive than you.pclewis is less competitive than you.pclewis is less competitive than you.
 
Posts: 32
Karma: 14644
Join Date: Jan 2007
Location: Texas
Device: Kobo Aura One
PDF Converter

Here is a program that is cheap, $12.95, that works OK. Always some rework required. PDF is an output file, so it is what it is.

http://www.thebeatlesforever.com/pro...xt/abcpdf.html
pclewis is offline   Reply With Quote
Old 01-26-2007, 07:37 AM   #7
nekokami
fruminous edugeek
nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.
 
nekokami's Avatar
 
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
@pclewis, thanks. As I mentioned above, I've tried that one already. It does work, it just loses formatting because the text is formatted using different fonts, rather than text styles. I think I'm probably going to have to find someone with a copy of Acrobat Standard (or Pro) that I can use to change the fonts.

Thanks anyway
nekokami is offline   Reply With Quote
Old 01-28-2007, 09:59 AM   #8
nekokami
fruminous edugeek
nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.
 
nekokami's Avatar
 
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
In case anyone is curious, I did eventually determine that the problem was the custom encoding in the PDF, and I solved the problem by finding an alternate (HTML) source of the file. :/ Yet another good example of why PDF is not a good format for source files. (I'm planning to spend today learning LaTeX.)
nekokami is offline   Reply With Quote
Old 01-28-2007, 10:18 PM   #9
wallcraft
reader
wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.
 
wallcraft's Avatar
 
Posts: 6,975
Karma: 5183568
Join Date: Mar 2006
Location: Mississippi, USA
Device: Kindle 3, Kobo Glo HD
Foxit Reader Pro Pack might be a possibility. I only have the free Reader - which seems to maintain formating in its text view. The Pro Pack ($39) is needed for full-file text conversion.
wallcraft is offline   Reply With Quote
Old 01-29-2007, 09:18 AM   #10
nekokami
fruminous edugeek
nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.
 
nekokami's Avatar
 
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
Looks to me like Foxit's output is plain text, but at least the editor might help if I have to sort out a problem like this in the future.

Meanwhile, the HTML file I have doesn't seem to want to import using the eBw Librarian software. It spits out warnings about unknown fonts, then hits a fatal error. Oddly enough, two very similar files imported just fine. Fortunately, HTML converts to RTF fairly easily, so I can try that next.
nekokami is offline   Reply With Quote
Old 01-29-2007, 10:27 AM   #11
jęd
Evangelist
jęd has a complete set of Star Wars action figures.jęd has a complete set of Star Wars action figures.jęd has a complete set of Star Wars action figures.
 
Posts: 458
Karma: 293
Join Date: May 2006
Quote:
Originally Posted by nekokami
@jęd, do you recommend any particular command-line converter? I write in perl and (to a lesser extent) php, but I really don't have time to write code right now.
Pdftohtml...?
jęd is offline   Reply With Quote
Old 01-29-2007, 12:36 PM   #12
nekokami
fruminous edugeek
nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.
 
nekokami's Avatar
 
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
Thanks. I have come across so many links that I'd given up checking them all. I'm especially happy to see that it's available for OSX.

Update: pdftohtml also does not successfully convert this file. It is confused by the custom encoding, presumably, and outputs strings of punctuation instead of text.

Last edited by nekokami; 01-29-2007 at 08:14 PM. Reason: didn't work
nekokami is offline   Reply With Quote
Old 02-27-2007, 07:04 PM   #13
sartori
Connoisseur
sartori began at the beginning.
 
Posts: 54
Karma: 29
Join Date: Oct 2006
Nekokami -

I have one way that saves most of the formatting - if you have a gmail account email the pdf to yourself as an attachment. Then view the message and click view as html on the attachment. Gmail does split the pdf into pages but it does keep italics etc.

I figure it should be easy to remove the page headers and breaks from the html source.

Hope it helps,

Sartori
sartori is offline   Reply With Quote
Old 02-27-2007, 07:16 PM   #14
nekokami
fruminous edugeek
nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.
 
nekokami's Avatar
 
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
Thanks, that's a good trick. I'll have to try it.

I eventually found a version of the file that I could reformat, and I have done so, but I'll have to create a gmail account and try this to see if it would work on files with custom encoding.
nekokami is offline   Reply With Quote
Old 02-28-2007, 05:17 AM   #15
eimert
Connoisseur
eimert doesn't littereimert doesn't litter
 
Posts: 58
Karma: 140
Join Date: Jan 2007
Location: Germany
Device: Dell Axim X50v
Quote:
Originally Posted by nekokami
Thanks. I have come across so many links that I'd given up checking them all. I'm especially happy to see that it's available for OSX.

Update: pdftohtml also does not successfully convert this file. It is confused by the custom encoding, presumably, and outputs strings of punctuation instead of text.
Hi,
you could also use pdaConverter (http://www.jakewalk.de/wiki/pmwiki.p...e.PdaConverter). It is freeware (Windows). I used it quite often for converting pdf files for reading on my palm - ages ago :-)
Cheers,
Klaus
eimert is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Text formatting jerrywojo Ectaco jetBook 4 01-19-2010 03:37 PM
Tool for extracting pdf bookmarks geraschenko iRex 1 10-24-2009 03:42 PM
Extracting text UncleIvor Sony Reader 3 09-11-2009 01:56 PM
Text tool for formatting Gutenberg text files bob_ninja Workshop 5 11-13-2007 12:28 PM
PRS-500 Text Formatting Tool tesseract420 Sony Reader Dev Corner 5 09-13-2007 05:36 PM


All times are GMT -4. The time now is 08:18 PM.


MobileRead.com is a privately owned, operated and funded community.