MobileRead Forums - View Single Post - How to handle images in books while doing OCR of books?

Tex2002ans · 03-10-2020, 12:19 PM

Quote:

Originally Posted by Remomama

I am now in the process of converting this book (about South Indian Villages) from Archive.org [...]

Fantastic. Yet another book from the Public Domain that can be digitized.

But first, I would say work from a different PDF.

There are 2 others available on Archive.org, and this looks to be the better version:

https://archive.org/details/somesout...age/2/mode/2up

Quote:

Originally Posted by Remomama

Will you cut the image part and during proofreading add the image in the correct place or is there any other way to do it?

Depends.

Usually it's better to export from the original source, then clean up the images from there.

If there's only a handful of photos, you can easily open the PDF in GIMP (or any image editor), then crop manually.

If working in B&W, this is easier.

If working in Color/Grayscale, things could get a little more complicated (see below).

* * *

For this specific book though, I would recommend working directly with the JPEG2000 source files.

1. On the book's Archive.org main page, on the right side, below all the filetypes (PDF, EPUB, Kindle, [...]), you should see a "Show All" button.

Click on that.

2. In the "Show All" page, you'll see even more formats:

https://archive.org/download/somesouthindianv00slatiala

What you want is the one labeled "raw_jp2.zip". This is the JPEG2000 source images (much higher quality).

Compare the "Color PDF" vs. the "JPEG 2000" image on page 6:

Click image for larger version

Name: Some.South.Indian.-.p6[ColorPDF].jpg
Views: 399
Size: 727.9 KB
ID: 177623

Click image for larger version

Name: Some.South.Indian.-.p6[JPEG2000].jpg
Views: 396
Size: 1.49 MB
ID: 177624

If you zoom in, you can see the detail was completely mangled in the PDF version.

* * *

From there, you may want to do further image cleanup. (Removing the yellowing, restoring to grayscale, etc.)

Here are a few topics where that was discussed:

GrannyGrump's fantastic thread "What’s your “image rehab” routine?"
SBT's "Image cleanup tips"

(Everyone has different ways/methods/tools.)

* * *

If you want quicker/easier cropping, you could also use Scan Tailor Advanced. I wrote about it just a few weeks ago in:

Optimize PDFs from archive.org for E-Ink devices

but you would have to convert those JPEG2000 files into PNG or TIFF.

Using Scan Tailor Advanced, here's the image I was able to crop within a few minutes:

Click image for larger version

Name: Some.South.Indian.-.p6[ScanTailorAdvanced].jpg
Views: 380
Size: 1.38 MB
ID: 177625

Quote:

Originally Posted by Remomama

[...] there are images published in the book. OCR wont recognize it but I need to know how do you come around this problem. [...] I use Tesseract for OCR and then the proofread/edit the result text [...]

PS: ABBYY Finereader recognizes images and exports them right along with text.

If working on more Non-Fiction (with Footnotes, Images, Tables, Italics, Smallcaps, [...]), it may save you a lot of work in the long-run.

Recognizing formatting is just as important as the words themselves.

Finereader costs a pretty penny (get a slightly older version if budget is an issue), but you'll definitely save yourself a ton of hours in the long-run if you plan on converting more (complicated) books.