MobileRead Forums - View Single Post - Converting multi-column PDFs on Linux

avantman42 · 10-07-2011, 07:18 AM

I have some RPG PDFs that I'd like to be able to read on my Kindle. Converting them is a real pain, because the text is in two columns. After some experimentation, I've found the following set of commands, which appear to produce a plain text file with the text in the correct order:

Code:

pdftohtml -c -s -i -xml INPUT_FILE.pdf
sed -e s/"<[^>]*>"//g INPUT_FILE.xml > OUTPUT_FILE.txt

I've only tested this on a few files, but it worked quite well on those.

The text will probably need some cleaning up, and of course will contain no formatting, but I found that Calibre was quite intelligent at working out where headings were when converting the text file to a .mobi.

10-07-2011, 07:18 AM	#1
avantman42 Wizard Posts: 1,090 Karma: 6058305 Join Date: Sep 2010 Location: UK Device: Kindle Paperwhite	Converting multi-column PDFs on Linux I have some RPG PDFs that I'd like to be able to read on my Kindle. Converting them is a real pain, because the text is in two columns. After some experimentation, I've found the following set of commands, which appear to produce a plain text file with the text in the correct order: Code: pdftohtml -c -s -i -xml INPUT_FILE.pdf sed -e s/"<[^>]*>"//g INPUT_FILE.xml > OUTPUT_FILE.txt I've only tested this on a few files, but it worked quite well on those. The text will probably need some cleaning up, and of course will contain no formatting, but I found that Calibre was quite intelligent at working out where headings were when converting the text file to a .mobi.