View Single Post
Old 10-07-2011, 06:18 AM   #1
avantman42
Wizard
avantman42 ought to be getting tired of karma fortunes by now.avantman42 ought to be getting tired of karma fortunes by now.avantman42 ought to be getting tired of karma fortunes by now.avantman42 ought to be getting tired of karma fortunes by now.avantman42 ought to be getting tired of karma fortunes by now.avantman42 ought to be getting tired of karma fortunes by now.avantman42 ought to be getting tired of karma fortunes by now.avantman42 ought to be getting tired of karma fortunes by now.avantman42 ought to be getting tired of karma fortunes by now.avantman42 ought to be getting tired of karma fortunes by now.avantman42 ought to be getting tired of karma fortunes by now.
 
avantman42's Avatar
 
Posts: 1,090
Karma: 6058305
Join Date: Sep 2010
Location: UK
Device: Kindle Paperwhite
Converting multi-column PDFs on Linux

I have some RPG PDFs that I'd like to be able to read on my Kindle. Converting them is a real pain, because the text is in two columns. After some experimentation, I've found the following set of commands, which appear to produce a plain text file with the text in the correct order:

Code:
pdftohtml -c -s -i -xml INPUT_FILE.pdf
sed -e s/"<[^>]*>"//g INPUT_FILE.xml > OUTPUT_FILE.txt
I've only tested this on a few files, but it worked quite well on those.

The text will probably need some cleaning up, and of course will contain no formatting, but I found that Calibre was quite intelligent at working out where headings were when converting the text file to a .mobi.
avantman42 is offline   Reply With Quote