View Single Post
Old 11-04-2018, 08:05 PM   #1603
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,303
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by polarisrising View Post
I'm having some trouble converting a pdf and I was hoping I could get some advice. My goal is to turn a pdf with varying 2-column and 1-column text blocks, into a single column .epub. My thought process was to first run the pdf through k2pdfopt to generate the ocr correctly, in a single column, then run it through calibre.

I'm using k2pdfopt in terminal, on Arch Linux and I have Tesseract setup correctly. Here are my arguments:

Code:
-m 0.1in,0.8in,0.1in,0.2in -ocr t -ocrhmax .4 -ocrvis t -n- -wrap- -ws -.5 inmemoriarichar00kirk.pdf
Attached is the original pdf and the output that I'm getting.

Basically, the ocr font looks very squished and distorted, and when I run it through calibre, it's treating the work gaps as new <p>.
You don't need to use Tesseract OCR. Your PDF already has an OCR layer. I'd do something like this:

k2pdfopt -m 0.1in,0.34in,0.05in,0.25in -mode 2col inmemoriamrichar00kirk.pdf

Because the page isn't always in the same place, the -m selection is difficult if you want to crop off the page numbers and horizontal lines. You might add -ehl to erase horizontal lines.
willus is offline   Reply With Quote