MobileRead Forums - View Single Post - k2pdfopt: optimizes PDFs for viewing on e-readers

willus · 11-04-2018, 08:05 PM

Quote:

Originally Posted by polarisrising

I'm having some trouble converting a pdf and I was hoping I could get some advice. My goal is to turn a pdf with varying 2-column and 1-column text blocks, into a single column .epub. My thought process was to first run the pdf through k2pdfopt to generate the ocr correctly, in a single column, then run it through calibre.

I'm using k2pdfopt in terminal, on Arch Linux and I have Tesseract setup correctly. Here are my arguments:

Code:

-m 0.1in,0.8in,0.1in,0.2in -ocr t -ocrhmax .4 -ocrvis t -n- -wrap- -ws -.5 inmemoriarichar00kirk.pdf

Attached is the original pdf and the output that I'm getting.

Basically, the ocr font looks very squished and distorted, and when I run it through calibre, it's treating the work gaps as new <p>.

You don't need to use Tesseract OCR. Your PDF already has an OCR layer. I'd do something like this:

k2pdfopt -m 0.1in,0.34in,0.05in,0.25in -mode 2col inmemoriamrichar00kirk.pdf

Because the page isn't always in the same place, the -m selection is difficult if you want to crop off the page numbers and horizontal lines. You might add -ehl to erase horizontal lines.