View Single Post
Old 07-19-2023, 03:41 AM   #5
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Karellen View Post
In your 2020: "OCRing + EPUBing my first book: Tips?" link, you mention Scan Tailor Advanced.

The only release I could find that has an install file is v0.9.11.1 from 2014.
https://github.com/scantailor/scantailor/releases
The exact version of Scan Tailor Advanced I use is by 4lex4:

v1.0.16 was the latest (in 2018).

- - -

Side Note: In September 2019 there was an "Early Access" version, and then it seems like there hasn't been much activity since.

I think, since the 2019 stall, some other person created another fork of it here:

but I have no idea about that fork or what sorts of bugs/fixes have been done since.

- - -

Side Note #2: Looks like you linked to the original "Scan Tailor".

"Scan Tailor Advanced" took all the forks, pulled out all the best features, and combined them all into one super version.

The biggest features for me were:
  • multi-core support (so it runs MUCH faster than the original)
  • image formats besides TIFF

+ lots of other helpful things all listed on their Github.

- - -

Quote:
Originally Posted by Karellen View Post
It seems quite old, [...]
Doesn't matter. It's only used as a middle, pre-processing stage where you are cleaning up the raw images.

I don't foresee too much changing on that front any time soon.
  • You feed it the raw photos/scans.
  • it helps crop + fix the warping + normalize the B&W/grayscale/colors.
  • then you shove those images into OCR.

You can see me apply it in:

where I quickly:
  • took an Archive.org PDF
  • ran it through Scan Tailor Advanced
  • OCRed it in Finereader
  • Exported as (EPUB) + ran my regex on it.

You can compare my quickly-generated EPUB vs. the auto-generated Archive.org "EPUB".

Still, nowhere near as good as a manually corrected version, but WAY better quality than just spitting out raw text right out of the PDF.

Quote:
Originally Posted by Karellen View Post
[...] and you mention generational leaps to the OP, so I wonder if the same applies to this software.
Yes, generational leaps in the OCR.

Even on the free/open-source front, there's been a lot of action, but I haven't been following that too closely... Because those tools tended to:
  • focus on generating only the raw plaintext, ditching all the important formatting!
  • be commandline only.
    • (Or have really crappy GUIs.)
    • Okay if you are working on a small amount... but when you have to manually tweak/correct/mark pages, great GUI is key. It will save you so much pain further down the line.

Last edited by Tex2002ans; 07-19-2023 at 04:06 AM.
Tex2002ans is offline   Reply With Quote