View Single Post
Old 11-04-2018, 01:15 PM   #1602
polarisrising
Junior Member
polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'polarisrising understands when you whisper 'The dog barks at midnight.'
 
Posts: 5
Karma: 42206
Join Date: Nov 2018
Device: kindle paperwhite 3
No sure what I'm missing

I'm having some trouble converting a pdf and I was hoping I could get some advice. My goal is to turn a pdf with varying 2-column and 1-column text blocks, into a single column .epub. My thought process was to first run the pdf through k2pdfopt to generate the ocr correctly, in a single column, then run it through calibre.

I'm using k2pdfopt in terminal, on Arch Linux and I have Tesseract setup correctly. Here are my arguments:

Code:
-m 0.1in,0.8in,0.1in,0.2in -ocr t -ocrhmax .4 -ocrvis t -n- -wrap- -ws -.5 inmemoriarichar00kirk.pdf
Attached is the original pdf and the output that I'm getting.

Basically, the ocr font looks very squished and distorted, and when I run it through calibre, it's treating the work gaps as new <p>.

Thanks!
Attached Files
File Type: pdf inmemoriamrichar00kirk.pdf (6.61 MB, 474 views)
File Type: pdf inmemoriamrichar00kirk_k2opt.pdf (122.0 KB, 232 views)
polarisrising is offline   Reply With Quote