MobileRead Forums - View Single Post - Converting scanned images for use in epubs.

retiredbiker · 09-02-2023, 05:11 PM

I've done a lot of old magazine OCR from Internet Archive stuff. I use OCRFeeder, a front end for Tesseract, for the OCR. It lets me select blocks of text around images, avoid adverts, and deal with the dreaded "continued on page nn" right up front, so I don't have a mess in the text file to fix.

I use Gimp to edit individual images, but I'm no expert. I size full-page images for the resulting epubs with the longest dimension around 1200 px, and around 150 px/in resolution. This gives PLENTY of quality for zooming in on my Kobo, if needed. It keeps the file size reasonable, too. Of course if an image in the magazine is small, I just leave it that way, and eye-ball how to make it look on the reader; see below.

I do some cleanup, it depends on the image. Anything muddy in the original will be just terrible on e-ink. So going to grayscale and playing with contrast are common options. Color images are completely different, and frankly I struggle there, if the original is bad. But again, getting higher contrast is good for e-ink. Turning spotty backgrounds to white is definitely worth it for many old/yellow/brown images.

Those two-page title spreads I always stitch together into one image and take out any text, so the result is a plain rectangle for a title image. Never try and get text into some odd-shaped image, e-readers just won't do it.

Put the images into the epub with a css class that gives % height or width, and the other "auto". Never code them in with hard dimensions.

And if you are doing books for general consumption, have a heart for us old nearly blind folks...test your book on e-ink at really huge text sizes, like 24 or 36 points on the reader. That is like 3 or 4 words per line. A lot of fancy stuff that looks good at small text sizes just falls apart when you do that.

09-02-2023, 05:11 PM	#11
retiredbiker Evangelist Posts: 454 Karma: 3886916 Join Date: May 2013 Location: Ontario, Canada Device: Kindle KB, Oasis, Pop_Os!, Kobo Forma	I've done a lot of old magazine OCR from Internet Archive stuff. I use OCRFeeder, a front end for Tesseract, for the OCR. It lets me select blocks of text around images, avoid adverts, and deal with the dreaded "continued on page nn" right up front, so I don't have a mess in the text file to fix. I use Gimp to edit individual images, but I'm no expert. I size full-page images for the resulting epubs with the longest dimension around 1200 px, and around 150 px/in resolution. This gives PLENTY of quality for zooming in on my Kobo, if needed. It keeps the file size reasonable, too. Of course if an image in the magazine is small, I just leave it that way, and eye-ball how to make it look on the reader; see below. I do some cleanup, it depends on the image. Anything muddy in the original will be just terrible on e-ink. So going to grayscale and playing with contrast are common options. Color images are completely different, and frankly I struggle there, if the original is bad. But again, getting higher contrast is good for e-ink. Turning spotty backgrounds to white is definitely worth it for many old/yellow/brown images. Those two-page title spreads I always stitch together into one image and take out any text, so the result is a plain rectangle for a title image. Never try and get text into some odd-shaped image, e-readers just won't do it. Put the images into the epub with a css class that gives % height or width, and the other "auto". Never code them in with hard dimensions. And if you are doing books for general consumption, have a heart for us old nearly blind folks...test your book on e-ink at really huge text sizes, like 24 or 36 points on the reader. That is like 3 or 4 words per line. A lot of fancy stuff that looks good at small text sizes just falls apart when you do that. Last edited by retiredbiker; 09-02-2023 at 05:14 PM.