MobileRead Forums - View Single Post - From print to ePub

Tex2002ans · 07-19-2023, 03:41 AM

Quote:

Originally Posted by Karellen

In your 2020: "OCRing + EPUBing my first book: Tips?" link, you mention Scan Tailor Advanced.

The only release I could find that has an install file is v0.9.11.1 from 2014.
https://github.com/scantailor/scantailor/releases

The exact version of Scan Tailor Advanced I use is by 4lex4:

https://github.com/4lex4/scantailor-advanced

v1.0.16 was the latest (in 2018).

- - -

Side Note: In September 2019 there was an "Early Access" version, and then it seems like there hasn't been much activity since.

I think, since the 2019 stall, some other person created another fork of it here:

https://github.com/ScanTailor-Advanc...ailor-advanced

but I have no idea about that fork or what sorts of bugs/fixes have been done since.

- - -

Side Note #2: Looks like you linked to the original "Scan Tailor".

"Scan Tailor Advanced" took all the forks, pulled out all the best features, and combined them all into one super version.

The biggest features for me were:

multi-core support (so it runs MUCH faster than the original)
image formats besides TIFF

+ lots of other helpful things all listed on their Github.

- - -

Quote:

Originally Posted by Karellen

It seems quite old, [...]

Doesn't matter. It's only used as a middle, pre-processing stage where you are cleaning up the raw images.

I don't foresee too much changing on that front any time soon.

You feed it the raw photos/scans.
it helps crop + fix the warping + normalize the B&W/grayscale/colors.
then you shove those images into OCR.

You can see me apply it in:

2020: "Optimize PDFs from archive.org for E-Ink devices"

where I quickly:

took an Archive.org PDF
ran it through Scan Tailor Advanced
OCRed it in Finereader
Exported as (EPUB) + ran my regex on it.

You can compare my quickly-generated EPUB vs. the auto-generated Archive.org "EPUB".

Still, nowhere near as good as a manually corrected version, but WAY better quality than just spitting out raw text right out of the PDF.

Quote:

Originally Posted by Karellen

[...] and you mention generational leaps to the OP, so I wonder if the same applies to this software.

Yes, generational leaps in the OCR.

Even on the free/open-source front, there's been a lot of action, but I haven't been following that too closely... Because those tools tended to:

focus on generating only the raw plaintext, ditching all the important formatting!
be commandline only.
- (Or have really crappy GUIs.)
- Okay if you are working on a small amount... but when you have to manually tweak/correct/mark pages, great GUI is key. It will save you so much pain further down the line.