Quote:
Originally Posted by willus
The k2pdfopt app fits most of what you want (e.g. command-line, linux). It has a thread here in the PDF forum on MR.
|
Fantastic work as always. Yes, if you wanted to keep it in PDF form... your tool is always best. :P
Quote:
Originally Posted by ctop
|
But if you want to take steps in making the PDF a proper ebook:
I grabbed this book and ran it through Scan Tailor Advanced + Finereader 12.
1. Finereader 12 did a MUCH better job with colored PDF's "yellowed pages", and had no issues creating a B&W version.
I attached it as [Finereader][BW].
(You can see how much better 12 converts compared to 8.)
Side Note: I manually erased markings from the first few pages, so they look pure white... just ignore that in your comparisons.
Note: Alternatively, you could've fed color images into Scan Tailor directly (it has 3 different methods to convert to B&W/Grayscale + you can mess with the thresholds).
2. I exported the B&W PDF into PNGs, then ran that through Scan Tailor Advanced.
I spent about an hour going through the various stages, and Scan Tailor did a FANTASTIC job at automatically picking all correct boxes. The page edges are nearly all gone.
I would say 95%+ I didn't have to touch at all.
Side Note: Despeckling + Outputting has gotten so much faster/better compared to how it used to be. And I only had to use Despeckling on a handful of pages to remove the occasional stray dots. (Being able to see the before/afters marked with red is an enormous help. This is one step where GUI beats the pants off of pure commandline.)
3. I took the Scan Tailor images, and reimported them into Finereader 12, ran OCR, and output as:
PDF = [ScanTailor][Finereader][BW].pdf.
(30 MBs is too large to attach, so here's a download.)
EPUB = [Finereader].epub.
You can compare the text, and see how much more accurate 12 is compared to Archive.org's "EPUB". (Most importantly, the headers+page numbers are nearly all automatically removed and not clogging the text.)
4. I took Finereader's EPUB and ran it through my usual "Finereader cleanup Regex":
Attached it as [Finereader][CodeCleanup].epub.
Comparison Images
Archive.org Color PDF + Finereader B&W + Scan Tailor Cleanup:
![Click image for larger version
Name: smtlichewer16goet.-.p16-17[Original.Color].jpg
Views: 1079
Size: 242.5 KB
ID: 177414](https://www.mobileread.com/forums/attachment.php?attachmentid=177414&thumb=1&d=1582877574)
![Click image for larger version
Name: smtlichewer16goet.-.p16-17[Finereader].png
Views: 1059
Size: 284.4 KB
ID: 177416](https://www.mobileread.com/forums/attachment.php?attachmentid=177416&thumb=1&d=1582877574)