Quote:
Originally Posted by polarisrising
I'm having some trouble converting a pdf and I was hoping I could get some advice. My goal is to turn a pdf with varying 2-column and 1-column text blocks, into a single column .epub. My thought process was to first run the pdf through k2pdfopt to generate the ocr correctly, in a single column, then run it through calibre.
I'm using k2pdfopt in terminal, on Arch Linux and I have Tesseract setup correctly. Here are my arguments:
Code:
-m 0.1in,0.8in,0.1in,0.2in -ocr t -ocrhmax .4 -ocrvis t -n- -wrap- -ws -.5 inmemoriarichar00kirk.pdf
Attached is the original pdf and the output that I'm getting.
Basically, the ocr font looks very squished and distorted, and when I run it through calibre, it's treating the work gaps as new <p>.
|
You don't need to use Tesseract OCR. Your PDF already has an OCR layer. I'd do something like this:
k2pdfopt -m 0.1in,0.34in,0.05in,0.25in -mode 2col inmemoriamrichar00kirk.pdf
Because the page isn't always in the same place, the -m selection is difficult if you want to crop off the page numbers and horizontal lines. You might add -ehl to erase horizontal lines.