MobileRead Forums

MobileRead Forums (https://www.mobileread.com/forums/index.php)
-   Workshop (https://www.mobileread.com/forums/forumdisplay.php?f=178)
-   -   OCRing + EPUBing my first book: Tips? (https://www.mobileread.com/forums/showthread.php?t=331376)

Shohreh 07-08-2020 05:51 PM

OCRing + EPUBing my first book: Tips?
 
1 Attachment(s)
Hello,

I'd like to turn an out-of-print paper book I have into an EPUB.

I just tried taking pictures of a few pages using my smartphone, and fed them to gImageReader (a GUI to Tesseract).

The text only has a few errors, and I'll have to manually remove mid-line carriage returns, but it's pretty good.

Are there tips you would recommend before I go ahead with the whole 250 pages and turn them into an EPUB (and PDF as well)?

Thank you.

hobnail 07-08-2020 06:10 PM

I've converted some books that are available as PDF + TXT from archive.org. I use sumatra for opening the PDFs since it knows about the invisible/hidden text layer; maybe they all do.

I use sigil and don't need to manually remove the mid-line carriage returns; you can use a search and replace to replace the blank lines between paragraphs with end paragraph tag followed by beginning paragraph tag; </p><p>. Then jump to the top of the book and add the missing beginning of paragraph tag, then to the bottom of the book and add the missing end of paragraph tag. Then use sigil's Mend and Prettify to make it look good in sigil. The hyphens that were at the ends of lines can be found by searching for hyphen followed by a space; you can't remove them all because sometimes it was a word that's normally hyphenated.

What archive.org uses often sees a screechmark/! or ell/l as a 1 so search for digits; there are threads here about this and other common errors and regexps for searching with.

BetterRed 07-08-2020 08:58 PM

If you have access to MS Word 2007 or later on Windows, there are a couple of very useful addins you should have a look at :

Toxaris' ePub Tools, it was specifically created for the issue you're dealing with, see ==>> Index of Useful Links for Book Creators

TransTools (not free) addin, it has some overlap with Toxaris' addin, but it also has some unique tools for fixing scanned texts - such as its Unbreaker tool, see ==>> Translator Tools.

I use both.

BR

elibrarian 07-09-2020 04:31 AM

Quote:

Originally Posted by Shohreh (Post 4009571)
... I'll have to manually remove mid-line carriage returns, but it's pretty good.

gImagereader has its own tool for that - second button from left in the output pane - which may (or may not) work for you. You have to mark the text before running it (CTRL+A works), and on longer texts it will take some seconds before the result shows.

Regards,

Kim

Shohreh 07-09-2020 04:53 AM

Thanks much!

pdurrant 07-09-2020 06:23 AM

You will need to do a lot of proof-reading to catch OCR errors. rn/m etc.

Quoth 07-09-2020 09:52 AM

Use a proper scanner. An archival scanner allows the book to sit thus \/ and uses far better cameras and lenses than in any phone. If it's a common book and the scanner has an ADF, the spine is usually cut off. An expert copy typist (maybe none left?) can probably beat an inexperienced person with a camera phone and need much less proof reading/editing.

Do make sure the copyright has expired. That is now quite complicated.

You'll want to proof read it entirely several times, with a gap of at least a week. You'll not see most of the errors if you are not experienced at proofing.

Pirates do this with ARCs and simply upload a PDF with unproofed text for search to Google Books/Playstore. IMO, the piracy on that and also pirated books packaged as Apps on the Playstore, that Google's book sales/distribution and their scanning of books for search (they DO store entire copyright works on public servers, they mislead during the court case).

phillipgessert 07-09-2020 12:08 PM

You might save a little time on the mid-paragraph carriage return thing if you treat that output as markdown. Markdown treats lines separated by a single carriage return as one continuous line/paragraph by default. Basically if you ran your example through pandoc or something, that first block will convert to one paragraph automatically.

DaleDe 07-09-2020 03:29 PM

check our wiki on OCR

Dale

Turtle91 07-09-2020 07:02 PM

You can also look at diybookscanner.org. They have been helping people build book scanners using cameras for several years. They have quite the community over there as well as software suggestions that might save you tons of time.


Shohreh 07-10-2020 05:34 AM

Thanks again.

I tried Abbyy FineReader, and it worked much better than gImageReader (ie. Tesseract).

Tex2002ans 07-12-2020 08:48 PM

Quote:

Originally Posted by Shohreh (Post 4009571)
I'd like to turn an out-of-print paper book I have into an EPUB.

[...]

Are there tips you would recommend before I go ahead with the whole 250 pages and turn them into an EPUB (and PDF as well)?

I've written extensively about this over the years.

On cleaning up your images, I would recommend using Scan Tailor Advanced. This crops your images, fixes distortion due to curved pages, and can turn them B&W.

I recently wrote a tutorial + more details about this just a few months ago: "Optimize PDFs from archive.org for E-Ink devices" (especially Post #2+#14).

On OCRing and all other errors/situations that may crop up, I recommend my detailed posts in the 2014 topic, "Delicate text digitalizing + scanning issues".

Not too much has changed since then... most of the steps and issues are still exactly the same in 2020.

Quote:

Originally Posted by Shohreh (Post 4010136)
I tried Abbyy FineReader, and it worked much better than gImageReader (ie. Tesseract).

:thumbsup:

Back in 2014, I wrote another post discussing all the ins-and-outs of free vs. proprietary OCR:

"Can you OCR the images inside of .pdf files?"

Most of the free tools get you the straight text, but then do a poorer job of carrying over the actual formatting (italics/bold, footnotes, superscript, tables, etc.).

Fiction, you would probably be okay... but the more complicated the book, the more time you're going to be spending trying to correct/readd all the formatting.

roger64 07-13-2020 05:40 AM

1 Attachment(s)
Quote:

Originally Posted by Shohreh (Post 4009571)
Hello,

I'd like to turn an out-of-print paper book I have into an EPUB.

I just tried taking pictures of a few pages using my smartphone, and fed them to gImageReader (a GUI to Tesseract).

The text only has a few errors, and I'll have to manually remove mid-line carriage returns, but it's pretty good.

Are there tips you would recommend before I go ahead with the whole 250 pages and turn them into an EPUB (and PDF as well)?

Thank you.

I also use Gimagereader-qt5 with Archlinux. Mine looks slightly different. :)

See screenshot

I process only .tif images coming from Scan Tailor.
I recognize text in HOCR format by blocks of 70 pages max
I save in html file (see red arrow)
I insert the block file in LibreOffice and save as odt.
Each block has a 3 mega size max
I suppress all bookmarks and sections, block by block.

the result is a clean enough odt file that will be later converted using ODTImport (a Sigil plugin).

patrik 07-13-2020 09:37 AM

Quote:

Originally Posted by Tex2002ans (Post 4011098)
I've written extensively about this over the years.

You have no idea how many notes I have due to your posts. :)

Quote:

On cleaning up your images, I would recommend using Scan Tailor Advanced. This crops your images, fixes distortion due to curved pages, and can turn them B&W.
Do you still use Scan Tailer if you are going to use Finereader afterwards?

BTW, I recently got a new scanner with a fairly good software (which uses Finereader for ocr). The best output of it is docx. But I miss the "verify text" step. Have you, or anyone else, find a better, or at least equal way to go through the text to find errors?

Tex2002ans 07-13-2020 06:18 PM

2 Attachment(s)
Quote:

Originally Posted by patrik (Post 4011223)
Do you still use Scan Tailer if you are going to use Finereader afterwards?

Scan Tailor Advanced is best used as a pre-OCR step.

Really only used if you have ugly input that needs serious cleaning.

You mentioned taking pictures with your smartphone, so that would cause issues like:
  • Rotation
  • Spine showing
    • Can easily cut Left/Right pages, or crop the spine out of the image.
  • Bent pages (thus bent/wavy lines of text)
    • It can dewarp them to become straight.
  • Uneven Lighting/Color (Yellowed Pages)
    • When trying to grayscale/B&W, you could get a "ring" or tons of black speckles.

So Scan Tailor would take you from something like this:

Attachment 180573 Attachment 180574

to this:

Attachment 177415

(Those images were from the book in the "Optimize PDFs" thread.)

Related Side Note: I also gave an example of handling OCR + images in "How to handle images in books while doing OCR of books?".

Quote:

Originally Posted by patrik (Post 4011223)
BTW, I recently got a new scanner with a fairly good software (which uses Finereader for ocr).

Is it the full Finereader? Or just some instant scan -> PDF/DOCX thing?

If it's the full Finereader, you should be able to open it up and have an Original+OCR split in the Left/Right windows.

See my posts in 2013, "Best way to copy text from a PDF or MOBI?". This lets you easily see a magnified version of the exact location in the book, and make sure the text is correct.

That's exactly how I squash most errors... right at the source!

Quote:

Originally Posted by patrik (Post 4011223)
The best output of it is docx.

That's one thing I've changed within the past few years... now I trust Toxaris's EPUB Tools to clean up Finereader's cruft.

When you export, change Finereader to "Formatted Text" and DOCX. Toxaris's EPUB Tools will then clean up the rest.

From there, you could do further cleaning in DOCX (if that's what you're comfortable with), or get it into EPUB as soon as possible + do your cleaning there (that's what I prefer).

Quote:

Originally Posted by patrik (Post 4011223)
But I miss the "verify text" step. Have you, or anyone else, find a better, or at least equal way to go through the text to find errors?

There's still multiple rounds of proofing that has to be done. Nothing gets rid of that. :D

As always, it's best to squash this stuff as close to the source as possible.

1. Clean Input Images = More Accurate OCR

The cleaner the input, the less time wasted fixing errors. :)

2. Mark/Proofread in Finereader

This is where you make sure "big picture" things are marked—Text, Images, Tables, Headers/Footers.

Then it's helpful to focus on all the "blue highlights" (unsure characters) and fix as many of those as you can.

Also making sure things like bold/italics/superscripts are carried over properly.

3. Export DOCX (or EPUB) out of Finereader

Do further cleanup.

Toxaris's EPUB Tools merges accidental split paragraphs together, etc.

You may have to re-correct "odd" line breaks that may have accidentally been merged, for example, poetry.

If you're comfortable with Word, you may want to add in some more Styles/formatting here (headings, blockquotes, captions, [...]).

4. Clean the EPUB

This is where you also make sure all the little things are correct:
  • Headings are <h1>-<h6>
  • Paragraphs are correct
  • Indentation is correct
  • Blockquotes are <blockquote>s
  • Footnotes are footnotes
  • Left/Center/Right alignment
  • [...]

And with Sigil/Calibre, you have access to more powerful tools/Regex.

For example, one of my favorite tricks is still to search for all hyphenated words in the Spellcheck Lists (I wrote about that all the way back in 2013!).

And now that "numbers are words", you can use a similar trick to find whole classes of OCR errors (0<->O, 1<->l). (See "Suggestion: Spellcheck Enhancement (Numbers)").

5. Run through a final Spellcheck/Grammarcheck pass

See my 2018 post in, "Does Tool Exist to Spellcheck/Grammarcheck by Category?".

If you spellchecked in Sigil/Calibre, maybe try Word (different dictionaries may point out other misspellings).

If you grammarchecked in Word, maybe try LanguageTool or Antidote. Different tools might catch different errors.

And I definitely run EPUB Tools's Dialogue Check—it's the best damn thing since sliced bread, and it catches all the mismatching quotation marks + parentheses/brackets.

Quote:

Originally Posted by patrik (Post 4011223)
You have no idea how many notes I have due to your posts. :)

:thumbsup:

I'd be interested in learning what sorts of things you marked down in your notes.

I've been trying to put together an "FAQ"-type series of posts for the blog... and I have no idea what sorts of things people found useful over the years.

PM me if you don't want to type about it here. (Wouldn't want to derail this thread.) :D


All times are GMT -4. The time now is 06:22 PM.

Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.