Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 03-08-2020, 03:10 PM   #1
Remomama
Junior Member
Remomama began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Mar 2020
Location: Tamilnadu, India
Device: Kindle Touch 2
Question How to handle images in books while doing OCR of books?

I am now in the process of converting this book (about South Indian Villages) from Archive.org and there are images published in the book. OCR wont recognize it but I need to know how do you come around this problem. Will you cut the image part and during proofreading add the image in the correct place or is there any other way to do it?

Before this I converted only fiction books so there wasn't that much proofreading in that. I use Tesseract for OCR and then theproofread/edit the result text in Libreoffice Writer and use the EPUB export functionality bundled with that. Then I will convert EPUB to AZW3 in Calibre to read in my Kindle Touch 2.

in advance

Sorry if my English is bad.
Remomama is offline   Reply With Quote
Old 03-08-2020, 11:54 PM   #2
hobnail
Running with scissors
hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.
 
Posts: 1,081
Karma: 12203626
Join Date: Nov 2019
Device: none
Distributed Proofreaders might explain how they do it on their web site. This is their blog, the site I'm referring to is in one of the blue boxes on the right: https://blog.pgdp.net/ But I suspect they do their own scans. If I were doing it I'd do the cut and paste that you describe.
hobnail is online now   Reply With Quote
Advert
Old 03-09-2020, 01:37 AM   #3
Remomama
Junior Member
Remomama began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Mar 2020
Location: Tamilnadu, India
Device: Kindle Touch 2
Quote:
Originally Posted by hobnail View Post
Distributed Proofreaders might explain how they do it on their web site. This is their blog, the site I'm referring to is in one of the blue boxes on the right: https://blog.pgdp.net/ But I suspect they do their own scans. If I were doing it I'd do the cut and paste that you describe.
Thanks
Remomama is offline   Reply With Quote
Old 03-10-2020, 11:19 AM   #4
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 1,831
Karma: 7935843
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Remomama View Post
I am now in the process of converting this book (about South Indian Villages) from Archive.org [...]
Fantastic. Yet another book from the Public Domain that can be digitized.

But first, I would say work from a different PDF.

There are 2 others available on Archive.org, and this looks to be the better version:

https://archive.org/details/somesout...age/2/mode/2up

Quote:
Originally Posted by Remomama View Post
Will you cut the image part and during proofreading add the image in the correct place or is there any other way to do it?
Depends.

Usually it's better to export from the original source, then clean up the images from there.

If there's only a handful of photos, you can easily open the PDF in GIMP (or any image editor), then crop manually.

If working in B&W, this is easier.

If working in Color/Grayscale, things could get a little more complicated (see below).

* * *

For this specific book though, I would recommend working directly with the JPEG2000 source files.

1. On the book's Archive.org main page, on the right side, below all the filetypes (PDF, EPUB, Kindle, [...]), you should see a "Show All" button.

Click on that.

2. In the "Show All" page, you'll see even more formats:

https://archive.org/download/somesouthindianv00slatiala

What you want is the one labeled "raw_jp2.zip". This is the JPEG2000 source images (much higher quality).

Compare the "Color PDF" vs. the "JPEG 2000" image on page 6:

Click image for larger version

Name:	Some.South.Indian.-.p6[ColorPDF].jpg
Views:	137
Size:	727.9 KB
ID:	177623 Click image for larger version

Name:	Some.South.Indian.-.p6[JPEG2000].jpg
Views:	136
Size:	1.49 MB
ID:	177624

If you zoom in, you can see the detail was completely mangled in the PDF version.

* * *

From there, you may want to do further image cleanup. (Removing the yellowing, restoring to grayscale, etc.)

Here are a few topics where that was discussed:

GrannyGrump's fantastic thread "What’s your “image rehab” routine?"
SBT's "Image cleanup tips"

(Everyone has different ways/methods/tools.)

* * *

If you want quicker/easier cropping, you could also use Scan Tailor Advanced. I wrote about it just a few weeks ago in:

Optimize PDFs from archive.org for E-Ink devices

but you would have to convert those JPEG2000 files into PNG or TIFF.

Using Scan Tailor Advanced, here's the image I was able to crop within a few minutes:

Click image for larger version

Name:	Some.South.Indian.-.p6[ScanTailorAdvanced].jpg
Views:	125
Size:	1.38 MB
ID:	177625

Quote:
Originally Posted by Remomama View Post
[...] there are images published in the book. OCR wont recognize it but I need to know how do you come around this problem. [...] I use Tesseract for OCR and then the proofread/edit the result text [...]
PS: ABBYY Finereader recognizes images and exports them right along with text.

If working on more Non-Fiction (with Footnotes, Images, Tables, Italics, Smallcaps, [...]), it may save you a lot of work in the long-run.

Recognizing formatting is just as important as the words themselves.

Finereader costs a pretty penny (get a slightly older version if budget is an issue), but you'll definitely save yourself a ton of hours in the long-run if you plan on converting more (complicated) books.

Last edited by Tex2002ans; 03-10-2020 at 12:08 PM.
Tex2002ans is online now   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Why Kindle format still can't handle books with maps, pictures (travel books)? avid01 General Discussions 2 01-25-2014 05:18 PM
How do you handle books that are in two series? LadyKate Library Management 3 08-30-2013 11:32 AM
What format should be used for school books that can't be OCR'ed? Mastiff General Discussions 0 04-01-2011 03:53 PM
How well does the Kindle 3 handle lots of books? stodge Amazon Kindle 16 11-10-2010 07:43 PM
PRS-505 Too many books for the 505 to handle? Belfaborac Sony Reader 13 05-28-2010 02:54 AM


All times are GMT -4. The time now is 02:34 PM.


MobileRead.com is a privately owned, operated and funded community.