View Single Post
Old 02-14-2013, 09:06 AM   #324
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,272
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by gadd View Post
i am having difficulty to convert a scanned two page in a row file into k2opt format. i have tried -ocr command but the result was ineffective. which command lines would you suggest to get a solid version of this pdf for kindle device?

p.s. i attach those files to the post.
This is a tough one because the scan isn't great quality. Try both the commands below in succession. (I've re-named your file to be more manageable on the command-line):

k2pdfopt -mode copy -n -grid 2x1x1.5 -w 1t -h 1t century20.pdf -o temp.pdf

k2pdfopt -dev kpw -as temp.pdf -m 0.4,0.2,0.4,0.2 -de 1.5 -gtr .015 -o century20_k2pdfopt.pdf


The first command splits the book into a one-page-per-page temp file so that each page can be auto-straightened. The second command processes the new temp file to create the final output. If you want to turn on OCR, Tesseract does a decent job of OCR-ing the temp file if you have it installed. Just add -ocr to the second command. Here's what some of the other options do (a complete list of command-line options is here):

-dev kpw sets for paperwhite. You can leave that off if you have an older kindle.

-as will auto-straighten each page, which will help k2pdfopt break up the rows correctly.

-m 0.4,0.2,0.4,0.2 will ignore 0.4 inches on the sides and 0.2 inches on the top and bottom of the temp file. This will chop off some of the unwanted marks in the margins that are keeping k2pdfopt from re-flowing the document correctly.

-de 1.5 sets the defect size to 1.5 points so that little marks that are up to 1.5 points in size will be ignored (the scan quality is poor and you have lots of these).

-gtr 0.015 will make k2pdfopt a little more aggressive than normal in breaking lines of text since some of them are pretty close together.

Of course, you can try adjusting any or all of the options above--I only had four pages of your book to work with, so they might no be tuned quite correctly. I wasn't able to get a perfect result, but it's an improvement over what you got, I think.

Last edited by willus; 02-14-2013 at 09:09 AM.
willus is offline   Reply With Quote