Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 07-08-2020, 04:51 PM   #1
Shohreh
Zealot
Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.
 
Posts: 148
Karma: 192898
Join Date: Jan 2016
Device: none
Question OCRing + EPUBing my first book: Tips?

Hello,

I'd like to turn an out-of-print paper book I have into an EPUB.

I just tried taking pictures of a few pages using my smartphone, and fed them to gImageReader (a GUI to Tesseract).

The text only has a few errors, and I'll have to manually remove mid-line carriage returns, but it's pretty good.

Are there tips you would recommend before I go ahead with the whole 250 pages and turn them into an EPUB (and PDF as well)?

Thank you.
Attached Thumbnails
Click image for larger version

Name:	BA6BD2F4-9BF1-4562-BEA4-CDE3A2BBFF82.png
Views:	459
Size:	133.0 KB
ID:	180488  
Shohreh is offline   Reply With Quote
Old 07-08-2020, 05:10 PM   #2
hobnail
Running with scissors
hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.hobnail ought to be getting tired of karma fortunes by now.
 
Posts: 1,552
Karma: 14325282
Join Date: Nov 2019
Device: none
I've converted some books that are available as PDF + TXT from archive.org. I use sumatra for opening the PDFs since it knows about the invisible/hidden text layer; maybe they all do.

I use sigil and don't need to manually remove the mid-line carriage returns; you can use a search and replace to replace the blank lines between paragraphs with end paragraph tag followed by beginning paragraph tag; </p><p>. Then jump to the top of the book and add the missing beginning of paragraph tag, then to the bottom of the book and add the missing end of paragraph tag. Then use sigil's Mend and Prettify to make it look good in sigil. The hyphens that were at the ends of lines can be found by searching for hyphen followed by a space; you can't remove them all because sometimes it was a word that's normally hyphenated.

What archive.org uses often sees a screechmark/! or ell/l as a 1 so search for digits; there are threads here about this and other common errors and regexps for searching with.

Last edited by hobnail; 07-08-2020 at 05:15 PM.
hobnail is offline   Reply With Quote
Advert
Old 07-08-2020, 07:58 PM   #3
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,568
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
If you have access to MS Word 2007 or later on Windows, there are a couple of very useful addins you should have a look at :

Toxaris' ePub Tools, it was specifically created for the issue you're dealing with, see ==>> Index of Useful Links for Book Creators

TransTools (not free) addin, it has some overlap with Toxaris' addin, but it also has some unique tools for fixing scanned texts - such as its Unbreaker tool, see ==>> Translator Tools.

I use both.

BR
BetterRed is online now   Reply With Quote
Old 07-09-2020, 03:31 AM   #4
elibrarian
Imperfect Perfectionist
elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.elibrarian ought to be getting tired of karma fortunes by now.
 
elibrarian's Avatar
 
Posts: 464
Karma: 724664
Join Date: Dec 2011
Location: Ølstykke, Denmark
Device: none
Quote:
Originally Posted by Shohreh View Post
... I'll have to manually remove mid-line carriage returns, but it's pretty good.
gImagereader has its own tool for that - second button from left in the output pane - which may (or may not) work for you. You have to mark the text before running it (CTRL+A works), and on longer texts it will take some seconds before the result shows.

Regards,

Kim
elibrarian is offline   Reply With Quote
Old 07-09-2020, 03:53 AM   #5
Shohreh
Zealot
Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.
 
Posts: 148
Karma: 192898
Join Date: Jan 2016
Device: none
Thanks much!
Shohreh is offline   Reply With Quote
Advert
Old 07-09-2020, 05:23 AM   #6
pdurrant
The Grand Mouse 高貴的老鼠
pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.
 
pdurrant's Avatar
 
Posts: 71,506
Karma: 306214458
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Voyage
You will need to do a lot of proof-reading to catch OCR errors. rn/m etc.
pdurrant is offline   Reply With Quote
Old 07-09-2020, 08:52 AM   #7
Quoth
the rook, bossing Never.
Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.
 
Quoth's Avatar
 
Posts: 11,156
Karma: 85874891
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
Use a proper scanner. An archival scanner allows the book to sit thus \/ and uses far better cameras and lenses than in any phone. If it's a common book and the scanner has an ADF, the spine is usually cut off. An expert copy typist (maybe none left?) can probably beat an inexperienced person with a camera phone and need much less proof reading/editing.

Do make sure the copyright has expired. That is now quite complicated.

You'll want to proof read it entirely several times, with a gap of at least a week. You'll not see most of the errors if you are not experienced at proofing.

Pirates do this with ARCs and simply upload a PDF with unproofed text for search to Google Books/Playstore. IMO, the piracy on that and also pirated books packaged as Apps on the Playstore, that Google's book sales/distribution and their scanning of books for search (they DO store entire copyright works on public servers, they mislead during the court case).

Last edited by Quoth; 07-09-2020 at 08:55 AM.
Quoth is offline   Reply With Quote
Old 07-09-2020, 11:08 AM   #8
phillipgessert
Addict
phillipgessert ought to be getting tired of karma fortunes by now.phillipgessert ought to be getting tired of karma fortunes by now.phillipgessert ought to be getting tired of karma fortunes by now.phillipgessert ought to be getting tired of karma fortunes by now.phillipgessert ought to be getting tired of karma fortunes by now.phillipgessert ought to be getting tired of karma fortunes by now.phillipgessert ought to be getting tired of karma fortunes by now.phillipgessert ought to be getting tired of karma fortunes by now.phillipgessert ought to be getting tired of karma fortunes by now.phillipgessert ought to be getting tired of karma fortunes by now.phillipgessert ought to be getting tired of karma fortunes by now.
 
phillipgessert's Avatar
 
Posts: 311
Karma: 3196258
Join Date: Oct 2015
Location: Madison, WI
Device: Kindle 5th Gen
You might save a little time on the mid-paragraph carriage return thing if you treat that output as markdown. Markdown treats lines separated by a single carriage return as one continuous line/paragraph by default. Basically if you ran your example through pandoc or something, that first block will convert to one paragraph automatically.
phillipgessert is offline   Reply With Quote
Old 07-09-2020, 02:29 PM   #9
DaleDe
Grand Sorcerer
DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.
 
DaleDe's Avatar
 
Posts: 11,470
Karma: 13095790
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7
check our wiki on OCR

Dale
DaleDe is offline   Reply With Quote
Old 07-09-2020, 06:02 PM   #10
Turtle91
A Hairy Wizard
Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.
 
Turtle91's Avatar
 
Posts: 3,094
Karma: 18727053
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
You can also look at diybookscanner.org. They have been helping people build book scanners using cameras for several years. They have quite the community over there as well as software suggestions that might save you tons of time.

Turtle91 is offline   Reply With Quote
Old 07-10-2020, 04:34 AM   #11
Shohreh
Zealot
Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.
 
Posts: 148
Karma: 192898
Join Date: Jan 2016
Device: none
Thanks again.

I tried Abbyy FineReader, and it worked much better than gImageReader (ie. Tesseract).
Shohreh is offline   Reply With Quote
Old 07-12-2020, 07:48 PM   #12
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Shohreh View Post
I'd like to turn an out-of-print paper book I have into an EPUB.

[...]

Are there tips you would recommend before I go ahead with the whole 250 pages and turn them into an EPUB (and PDF as well)?
I've written extensively about this over the years.

On cleaning up your images, I would recommend using Scan Tailor Advanced. This crops your images, fixes distortion due to curved pages, and can turn them B&W.

I recently wrote a tutorial + more details about this just a few months ago: "Optimize PDFs from archive.org for E-Ink devices" (especially Post #2+#14).

On OCRing and all other errors/situations that may crop up, I recommend my detailed posts in the 2014 topic, "Delicate text digitalizing + scanning issues".

Not too much has changed since then... most of the steps and issues are still exactly the same in 2020.

Quote:
Originally Posted by Shohreh View Post
I tried Abbyy FineReader, and it worked much better than gImageReader (ie. Tesseract).


Back in 2014, I wrote another post discussing all the ins-and-outs of free vs. proprietary OCR:

"Can you OCR the images inside of .pdf files?"

Most of the free tools get you the straight text, but then do a poorer job of carrying over the actual formatting (italics/bold, footnotes, superscript, tables, etc.).

Fiction, you would probably be okay... but the more complicated the book, the more time you're going to be spending trying to correct/readd all the formatting.

Last edited by Tex2002ans; 07-12-2020 at 08:08 PM.
Tex2002ans is offline   Reply With Quote
Old 07-13-2020, 04:40 AM   #13
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,608
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
Quote:
Originally Posted by Shohreh View Post
Hello,

I'd like to turn an out-of-print paper book I have into an EPUB.

I just tried taking pictures of a few pages using my smartphone, and fed them to gImageReader (a GUI to Tesseract).

The text only has a few errors, and I'll have to manually remove mid-line carriage returns, but it's pretty good.

Are there tips you would recommend before I go ahead with the whole 250 pages and turn them into an EPUB (and PDF as well)?

Thank you.
I also use Gimagereader-qt5 with Archlinux. Mine looks slightly different.

See screenshot

I process only .tif images coming from Scan Tailor.
I recognize text in HOCR format by blocks of 70 pages max
I save in html file (see red arrow)
I insert the block file in LibreOffice and save as odt.
Each block has a 3 mega size max
I suppress all bookmarks and sections, block by block.

the result is a clean enough odt file that will be later converted using ODTImport (a Sigil plugin).
Attached Thumbnails
Click image for larger version

Name:	ksnip.png
Views:	375
Size:	241.5 KB
ID:	180562  

Last edited by roger64; 07-13-2020 at 04:47 AM. Reason: image
roger64 is offline   Reply With Quote
Old 07-13-2020, 08:37 AM   #14
patrik
Guru
patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.patrik ought to be getting tired of karma fortunes by now.
 
Posts: 657
Karma: 4568205
Join Date: Jan 2010
Location: Sweden
Device: Kobo Forma
Quote:
Originally Posted by Tex2002ans View Post
I've written extensively about this over the years.
You have no idea how many notes I have due to your posts.

Quote:
On cleaning up your images, I would recommend using Scan Tailor Advanced. This crops your images, fixes distortion due to curved pages, and can turn them B&W.
Do you still use Scan Tailer if you are going to use Finereader afterwards?

BTW, I recently got a new scanner with a fairly good software (which uses Finereader for ocr). The best output of it is docx. But I miss the "verify text" step. Have you, or anyone else, find a better, or at least equal way to go through the text to find errors?
patrik is offline   Reply With Quote
Old 07-13-2020, 05:18 PM   #15
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by patrik View Post
Do you still use Scan Tailer if you are going to use Finereader afterwards?
Scan Tailor Advanced is best used as a pre-OCR step.

Really only used if you have ugly input that needs serious cleaning.

You mentioned taking pictures with your smartphone, so that would cause issues like:
  • Rotation
  • Spine showing
    • Can easily cut Left/Right pages, or crop the spine out of the image.
  • Bent pages (thus bent/wavy lines of text)
    • It can dewarp them to become straight.
  • Uneven Lighting/Color (Yellowed Pages)
    • When trying to grayscale/B&W, you could get a "ring" or tons of black speckles.

So Scan Tailor would take you from something like this:

Click image for larger version

Name:	Page16.jpg
Views:	599
Size:	1.44 MB
ID:	180573 Click image for larger version

Name:	Page17.jpg
Views:	567
Size:	1.37 MB
ID:	180574

to this:

Attachment 177415

(Those images were from the book in the "Optimize PDFs" thread.)

Related Side Note: I also gave an example of handling OCR + images in "How to handle images in books while doing OCR of books?".

Quote:
Originally Posted by patrik View Post
BTW, I recently got a new scanner with a fairly good software (which uses Finereader for ocr).
Is it the full Finereader? Or just some instant scan -> PDF/DOCX thing?

If it's the full Finereader, you should be able to open it up and have an Original+OCR split in the Left/Right windows.

See my posts in 2013, "Best way to copy text from a PDF or MOBI?". This lets you easily see a magnified version of the exact location in the book, and make sure the text is correct.

That's exactly how I squash most errors... right at the source!

Quote:
Originally Posted by patrik View Post
The best output of it is docx.
That's one thing I've changed within the past few years... now I trust Toxaris's EPUB Tools to clean up Finereader's cruft.

When you export, change Finereader to "Formatted Text" and DOCX. Toxaris's EPUB Tools will then clean up the rest.

From there, you could do further cleaning in DOCX (if that's what you're comfortable with), or get it into EPUB as soon as possible + do your cleaning there (that's what I prefer).

Quote:
Originally Posted by patrik View Post
But I miss the "verify text" step. Have you, or anyone else, find a better, or at least equal way to go through the text to find errors?
There's still multiple rounds of proofing that has to be done. Nothing gets rid of that.

As always, it's best to squash this stuff as close to the source as possible.

1. Clean Input Images = More Accurate OCR

The cleaner the input, the less time wasted fixing errors.

2. Mark/Proofread in Finereader

This is where you make sure "big picture" things are marked—Text, Images, Tables, Headers/Footers.

Then it's helpful to focus on all the "blue highlights" (unsure characters) and fix as many of those as you can.

Also making sure things like bold/italics/superscripts are carried over properly.

3. Export DOCX (or EPUB) out of Finereader

Do further cleanup.

Toxaris's EPUB Tools merges accidental split paragraphs together, etc.

You may have to re-correct "odd" line breaks that may have accidentally been merged, for example, poetry.

If you're comfortable with Word, you may want to add in some more Styles/formatting here (headings, blockquotes, captions, [...]).

4. Clean the EPUB

This is where you also make sure all the little things are correct:
  • Headings are <h1>-<h6>
  • Paragraphs are correct
  • Indentation is correct
  • Blockquotes are <blockquote>s
  • Footnotes are footnotes
  • Left/Center/Right alignment
  • [...]

And with Sigil/Calibre, you have access to more powerful tools/Regex.

For example, one of my favorite tricks is still to search for all hyphenated words in the Spellcheck Lists (I wrote about that all the way back in 2013!).

And now that "numbers are words", you can use a similar trick to find whole classes of OCR errors (0<->O, 1<->l). (See "Suggestion: Spellcheck Enhancement (Numbers)").

5. Run through a final Spellcheck/Grammarcheck pass

See my 2018 post in, "Does Tool Exist to Spellcheck/Grammarcheck by Category?".

If you spellchecked in Sigil/Calibre, maybe try Word (different dictionaries may point out other misspellings).

If you grammarchecked in Word, maybe try LanguageTool or Antidote. Different tools might catch different errors.

And I definitely run EPUB Tools's Dialogue Check—it's the best damn thing since sliced bread, and it catches all the mismatching quotation marks + parentheses/brackets.

Quote:
Originally Posted by patrik View Post
You have no idea how many notes I have due to your posts.


I'd be interested in learning what sorts of things you marked down in your notes.

I've been trying to put together an "FAQ"-type series of posts for the blog... and I have no idea what sorts of things people found useful over the years.

PM me if you don't want to type about it here. (Wouldn't want to derail this thread.)

Last edited by Tex2002ans; 07-13-2020 at 05:28 PM.
Tex2002ans is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
An advice on OCRing, please. nlundberg Workshop 6 03-13-2013 06:29 AM
Book Designer Hints and Tips Patricia Workshop 59 06-10-2010 07:14 AM


All times are GMT -4. The time now is 01:24 AM.


MobileRead.com is a privately owned, operated and funded community.