Quote:
Originally Posted by Shohreh
Hello,
I'd like to turn an out-of-print paper book I have into an EPUB.
I just tried taking pictures of a few pages using my smartphone, and fed them to gImageReader (a GUI to Tesseract).
The text only has a few errors, and I'll have to manually remove mid-line carriage returns, but it's pretty good.
Are there tips you would recommend before I go ahead with the whole 250 pages and turn them into an EPUB (and PDF as well)?
Thank you.
|
I also use Gimagereader-qt5 with Archlinux. Mine looks slightly different.
See screenshot
I process only .tif images coming from Scan Tailor.
I recognize text in HOCR format by blocks of 70 pages max
I save in html file (see red arrow)
I insert the block file in LibreOffice and save as odt.
Each block has a 3 mega size max
I suppress all bookmarks and sections, block by block.
the result is a clean enough odt file that will be later converted using ODTImport (a Sigil plugin).