View Single Post
Old 06-05-2020, 08:51 PM   #4
retiredbiker
Evangelist
retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.
 
retiredbiker's Avatar
 
Posts: 451
Karma: 3886916
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Pop_Os!, Kobo Forma
Rather than making 1000 files, you might consider doing it page by page, and putting the text of each page into Writer or Word. I use a GUI front-end to tesseract, OCRFeeder, that makes this easy. It also does a very good job of unwrapping lines, which is a nice leg up. You can load one or many images at a time, and recognise them one by one or all together, then just copy the text over to your book document. I tend to do about 20 pages per session.

I know it sounds dreary, but you have to proof it all anyway. I find that doing most of the proofing page by page, while I have the scanned image right in front of me, to be much less daunting than attacking the whole book later.

Then you can do some styling in the word processor as you build the book, like heading styles for chapters and basic styles to format the text. The result, as an .odt or .docx file, will convert to something a lot prettier than all that bare text, and most of the proofing and styling will be done.

And unless you have some other tools, how else will you get all those bare text lines unwrapped and enclosed in html tags?

Last edited by retiredbiker; 06-05-2020 at 09:11 PM.
retiredbiker is offline   Reply With Quote