MobileRead Forums - View Single Post - a project to scan and convert a printed book to EPUB format

retiredbiker · 06-05-2020, 08:51 PM

Rather than making 1000 files, you might consider doing it page by page, and putting the text of each page into Writer or Word. I use a GUI front-end to tesseract, OCRFeeder, that makes this easy. It also does a very good job of unwrapping lines, which is a nice leg up. You can load one or many images at a time, and recognise them one by one or all together, then just copy the text over to your book document. I tend to do about 20 pages per session.

I know it sounds dreary, but you have to proof it all anyway. I find that doing most of the proofing page by page, while I have the scanned image right in front of me, to be much less daunting than attacking the whole book later.

Then you can do some styling in the word processor as you build the book, like heading styles for chapters and basic styles to format the text. The result, as an .odt or .docx file, will convert to something a lot prettier than all that bare text, and most of the proofing and styling will be done.

And unless you have some other tools, how else will you get all those bare text lines unwrapped and enclosed in html tags?

06-05-2020, 08:51 PM	#4
retiredbiker Evangelist Posts: 451 Karma: 3886916 Join Date: May 2013 Location: Ontario, Canada Device: Kindle KB, Oasis, Pop_Os!, Kobo Forma	Rather than making 1000 files, you might consider doing it page by page, and putting the text of each page into Writer or Word. I use a GUI front-end to tesseract, OCRFeeder, that makes this easy. It also does a very good job of unwrapping lines, which is a nice leg up. You can load one or many images at a time, and recognise them one by one or all together, then just copy the text over to your book document. I tend to do about 20 pages per session. I know it sounds dreary, but you have to proof it all anyway. I find that doing most of the proofing page by page, while I have the scanned image right in front of me, to be much less daunting than attacking the whole book later. Then you can do some styling in the word processor as you build the book, like heading styles for chapters and basic styles to format the text. The result, as an .odt or .docx file, will convert to something a lot prettier than all that bare text, and most of the proofing and styling will be done. And unless you have some other tools, how else will you get all those bare text lines unwrapped and enclosed in html tags? Last edited by retiredbiker; 06-05-2020 at 09:11 PM.