Quote:
Originally Posted by Remomama
I am now in the process of converting this book (about South Indian Villages) from Archive.org [...]
|
Fantastic. Yet another book from the Public Domain that can be digitized.
But first, I would say work from a different PDF.
There are 2 others available on Archive.org, and this looks to be the better version:
https://archive.org/details/somesout...age/2/mode/2up
Quote:
Originally Posted by Remomama
Will you cut the image part and during proofreading add the image in the correct place or is there any other way to do it? 
|
Depends.
Usually it's better to export from the original source, then clean up the images from there.
If there's only a handful of photos, you can easily open the PDF in GIMP (or any image editor), then crop manually.
If working in B&W, this is easier.
If working in Color/Grayscale, things could get a little more complicated (see below).
* * *
For this specific book though, I would recommend working directly with the JPEG2000 source files.
1. On the book's Archive.org main page, on the right side, below all the filetypes (PDF, EPUB, Kindle, [...]), you should see a "Show All" button.
Click on that.
2. In the "Show All" page, you'll see even more formats:
https://archive.org/download/somesouthindianv00slatiala
What you want is the one labeled "raw_jp2.zip". This is the JPEG2000 source images (much higher quality).
Compare the "Color PDF" vs. the "JPEG 2000" image on page 6:
If you zoom in, you can see the detail was completely mangled in the PDF version.
* * *
From there, you may want to do further image cleanup. (Removing the yellowing, restoring to grayscale, etc.)
Here are a few topics where that was discussed:
GrannyGrump's fantastic thread "What’s your “image rehab” routine?"
SBT's "Image cleanup tips"
(Everyone has different ways/methods/tools.)
* * *
If you want quicker/easier cropping, you could also use Scan Tailor Advanced. I wrote about it just a few weeks ago in:
Optimize PDFs from archive.org for E-Ink devices
but you would have to convert those JPEG2000 files into PNG or TIFF.
Using Scan Tailor Advanced, here's the image I was able to crop within a few minutes:
Quote:
Originally Posted by Remomama
[...] there are images published in the book. OCR wont recognize it but I need to know how do you come around this problem. [...] I use Tesseract for OCR and then the proofread/edit the result text [...]
|
PS: ABBYY Finereader recognizes images and exports them right along with text.
If working on more Non-Fiction (with Footnotes, Images, Tables, Italics, Smallcaps, [...]), it may save you a lot of work in the long-run.
Recognizing formatting is just as important as the words themselves.
Finereader costs a pretty penny (get a slightly older version if budget is an issue), but you'll definitely save yourself a ton of hours in the long-run if you plan on converting more (complicated) books.