03-08-2020, 03:10 PM | #1 |
Junior Member
Posts: 3
Karma: 10
Join Date: Mar 2020
Location: Tamilnadu, India
Device: Kindle Touch 2
|
How to handle images in books while doing OCR of books?
I am now in the process of converting this book (about South Indian Villages) from Archive.org and there are images published in the book. OCR wont recognize it but I need to know how do you come around this problem. Will you cut the image part and during proofreading add the image in the correct place or is there any other way to do it?
Before this I converted only fiction books so there wasn't that much proofreading in that. I use Tesseract for OCR and then theproofread/edit the result text in Libreoffice Writer and use the EPUB export functionality bundled with that. Then I will convert EPUB to AZW3 in Calibre to read in my Kindle Touch 2. in advance Sorry if my English is bad. |
03-08-2020, 11:54 PM | #2 |
Running with scissors
Posts: 1,552
Karma: 14325282
Join Date: Nov 2019
Device: none
|
Distributed Proofreaders might explain how they do it on their web site. This is their blog, the site I'm referring to is in one of the blue boxes on the right: https://blog.pgdp.net/ But I suspect they do their own scans. If I were doing it I'd do the cut and paste that you describe.
|
03-09-2020, 01:37 AM | #3 | |
Junior Member
Posts: 3
Karma: 10
Join Date: Mar 2020
Location: Tamilnadu, India
Device: Kindle Touch 2
|
Quote:
|
|
03-10-2020, 11:19 AM | #4 | |||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
But first, I would say work from a different PDF. There are 2 others available on Archive.org, and this looks to be the better version: https://archive.org/details/somesout...age/2/mode/2up Quote:
Usually it's better to export from the original source, then clean up the images from there. If there's only a handful of photos, you can easily open the PDF in GIMP (or any image editor), then crop manually. If working in B&W, this is easier. If working in Color/Grayscale, things could get a little more complicated (see below). * * * For this specific book though, I would recommend working directly with the JPEG2000 source files. 1. On the book's Archive.org main page, on the right side, below all the filetypes (PDF, EPUB, Kindle, [...]), you should see a "Show All" button. Click on that. 2. In the "Show All" page, you'll see even more formats: https://archive.org/download/somesouthindianv00slatiala What you want is the one labeled "raw_jp2.zip". This is the JPEG2000 source images (much higher quality). Compare the "Color PDF" vs. the "JPEG 2000" image on page 6: If you zoom in, you can see the detail was completely mangled in the PDF version. * * * From there, you may want to do further image cleanup. (Removing the yellowing, restoring to grayscale, etc.) Here are a few topics where that was discussed: GrannyGrump's fantastic thread "What’s your “image rehab” routine?" SBT's "Image cleanup tips" (Everyone has different ways/methods/tools.) * * * If you want quicker/easier cropping, you could also use Scan Tailor Advanced. I wrote about it just a few weeks ago in: Optimize PDFs from archive.org for E-Ink devices but you would have to convert those JPEG2000 files into PNG or TIFF. Using Scan Tailor Advanced, here's the image I was able to crop within a few minutes: Quote:
If working on more Non-Fiction (with Footnotes, Images, Tables, Italics, Smallcaps, [...]), it may save you a lot of work in the long-run. Recognizing formatting is just as important as the words themselves. Finereader costs a pretty penny (get a slightly older version if budget is an issue), but you'll definitely save yourself a ton of hours in the long-run if you plan on converting more (complicated) books. Last edited by Tex2002ans; 03-10-2020 at 12:08 PM. |
|||
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Why Kindle format still can't handle books with maps, pictures (travel books)? | avid01 | General Discussions | 2 | 01-25-2014 05:18 PM |
How do you handle books that are in two series? | LadyKate | Library Management | 3 | 08-30-2013 11:32 AM |
What format should be used for school books that can't be OCR'ed? | Mastiff | General Discussions | 0 | 04-01-2011 03:53 PM |
How well does the Kindle 3 handle lots of books? | stodge | Amazon Kindle | 16 | 11-10-2010 07:43 PM |
PRS-505 Too many books for the 505 to handle? | Belfaborac | Sony Reader | 13 | 05-28-2010 02:54 AM |