View Single Post
Old 02-28-2020, 01:22 AM   #14
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by willus View Post
The k2pdfopt app fits most of what you want (e.g. command-line, linux). It has a thread here in the PDF forum on MR.
Fantastic work as always. Yes, if you wanted to keep it in PDF form... your tool is always best. :P

Quote:
Originally Posted by ctop View Post
An example of the PDFs I am looking at is this:

https://archive.org/details/smtliche...ge/n8/mode/2up

(This is the item page, the download link is here

https://archive.org/download/smtlich...r16goet_bw.pdf

Any help appreciated
But if you want to take steps in making the PDF a proper ebook:

I grabbed this book and ran it through Scan Tailor Advanced + Finereader 12.

1. Finereader 12 did a MUCH better job with colored PDF's "yellowed pages", and had no issues creating a B&W version.

I attached it as [Finereader][BW].

(You can see how much better 12 converts compared to 8.)

Side Note: I manually erased markings from the first few pages, so they look pure white... just ignore that in your comparisons.

Note: Alternatively, you could've fed color images into Scan Tailor directly (it has 3 different methods to convert to B&W/Grayscale + you can mess with the thresholds).

2. I exported the B&W PDF into PNGs, then ran that through Scan Tailor Advanced.

I spent about an hour going through the various stages, and Scan Tailor did a FANTASTIC job at automatically picking all correct boxes. The page edges are nearly all gone.

I would say 95%+ I didn't have to touch at all.

Side Note: Despeckling + Outputting has gotten so much faster/better compared to how it used to be. And I only had to use Despeckling on a handful of pages to remove the occasional stray dots. (Being able to see the before/afters marked with red is an enormous help. This is one step where GUI beats the pants off of pure commandline.)

3. I took the Scan Tailor images, and reimported them into Finereader 12, ran OCR, and output as:

PDF = [ScanTailor][Finereader][BW].pdf. (30 MBs is too large to attach, so here's a download.)
EPUB = [Finereader].epub.

You can compare the text, and see how much more accurate 12 is compared to Archive.org's "EPUB". (Most importantly, the headers+page numbers are nearly all automatically removed and not clogging the text.)

4. I took Finereader's EPUB and ran it through my usual "Finereader cleanup Regex":

Attached it as [Finereader][CodeCleanup].epub.

Comparison Images

Archive.org Color PDF + Finereader B&W + Scan Tailor Cleanup:

Click image for larger version

Name:	smtlichewer16goet.-.p16-17[Original.Color].jpg
Views:	1079
Size:	242.5 KB
ID:	177414Click image for larger version

Name:	smtlichewer16goet.-.p16-17[Finereader].png
Views:	1059
Size:	284.4 KB
ID:	177416Click image for larger version

Name:	smtlichewer16goet.-.p16-17[ScanTailor].png
Views:	1746
Size:	289.4 KB
ID:	177415

Last edited by Tex2002ans; 02-28-2020 at 03:14 AM.
Tex2002ans is offline   Reply With Quote