Thread: OCR engine
View Single Post
Old 05-05-2014, 03:56 AM   #60
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by cadele View Post
Using Word is probably the worst way of doing things but at least I am on familiar ground with it, and at least now that I have Abbyy 12 and the ScanSnap the many hours have been cut down enormously.
It isn't the worst, lots of people on the boards still use Word somewhere in their workflow.

Sadly, I can't give any tips to speed that entire section of a workflow up since I have zero experience in it.

Quote:
Originally Posted by cadele View Post
It actually doesn't take too long to proof in Word and I am keeping a list of common S&R - particularly annoying things like quotation marks and apostrophe's, and the dreaded 0 vs O, 1 instead of l etc.
Do you just have a list, and you manually copy/paste/search, copy/paste/search, copy/paste/search? Or is there some sort of method where you can mass run a bunch of searches?

For example, in Sigil, there is "Saved Searches": https://web.sigil.googlecode.com/git..._searches.html

And I hear that Calibre's Editor just recently added similar functionality as well.

Quote:
Originally Posted by cadele View Post
I don't have Sigil - I didn't think I could master it very well since I can't code or figure out even basic regex (much as I would like to be able to do this).
Bah, stop being so negative about your skills! You can do both!

HTML can be a little scary in the beginning, but if you keep everything super clean/simple (as I do), it is easy!

Quote:
Chapter 1

This is a sample sentence with bold and italic words.
This is a sample of a sentence in a blockquote.
This is a sample of a second sentence in a blockquote.
Changes into:

Quote:
<h2>Chapter 1</h2>

<p>This is a sample sentence with <b>bold</b> and <i>italic</i> words.</p>

<blockquote>
<p>This is a sample of a sentence in a blockquote.</p>
<p>This is a sample of a second sentence in a blockquote.</p>
</blockquote>
Regex can be scary in the beginning, but I don't think the ones I posted above are TOO scary... and they are extremely helpful.

So you just start out with the super basic ones, and then you build up piece by piece from there. 5 or more numbers in a row?
  • How about you try to get 4 or more numbers in a row?
  • Or pointing out instances of ONLY 3 numbers in a row?
  • Or try to get 4 numbers in a row followed by a comma?

Quote:
Originally Posted by cadele View Post
As far as the stats go, I am going to record the time taken and the word count, and probably whether there were any pictures etc which slow things up with the formatting.
These are the stats that I am planning on keeping for every book I convert:
  • Word Count
  • Hours to Convert
  • Words Per Hour (WPH)
    • Derived from Words/Hours
  • Hours spent on overhead (Email, Changelogs, etc. etc.)
  • # of Pictures/Figures
  • # of Footnotes
    • Endnotes/Footnotes?
    • Symboled Footnotes? (*, †, ‡, §, ‖, ¶)
    • (Endnotes are typically faster than footnotes at the bottom of each page, and Symboled Footnotes are SIGNIFICANTLY slower).

I might think of a few more some time. If I ever get into actually scanning the physical books, I will probably create a keep track of those hours separately as well. And if I ever get more into vectorizing charts/graphs, I will keep track of those as well.

I should also keep track of how long it takes me to actually read books... it is always interesting to see those stats! I currently keep track of all of my hours spent playing Video Games, and that is extremely helpful/useful.

Quote:
Originally Posted by cadele View Post
Your samples look good!
Thank you, I have been fishing around those PDFs/sample images the past few weeks, and it DEFINITELY blows the pants off of many of the scans that are currently out there. The few companies I do EPUB work for were definitely impressed with the quality of the PDFs.... (But as I said, I still have A TON to learn).

This method of PDF creation might be very nice in the cases where the condition of the original/older scans wasn't the greatest, (there might be writing/markings in the book, yellowed pages, water stains, ink blots, margins cut off, etc. etc.) (Take a gander at many of the Archive.org PDFs).

And these PDFs will DEMOLISH the current reprinted junk that is out there (scan -> slap on front/backmatter -> reprint, or scan -> very minor speckle cleanup -> slap on front/backmatter -> reprint).

Also, for those who DO want to read the PDF over an EPUB (I don't know who would be crazy enough to do this. ), then this LaTeX generated PDF will destroy the crappy PDF scans.

For example, here is some comparison shots of the first PDF I tackled using this method:

Click image for larger version

Name:	pg101Before.png
Views:	244
Size:	61.6 KB
ID:	122635Click image for larger version

Name:	pg101LaTeX.png
Views:	260
Size:	36.6 KB
ID:	122636
Click image for larger version

Name:	pg261Before.png
Views:	238
Size:	52.6 KB
ID:	122637Click image for larger version

Name:	pg261LaTeX.png
Views:	238
Size:	28.9 KB
ID:	122638
Click image for larger version

Name:	pg347Before.png
Views:	267
Size:	57.6 KB
ID:	122639Click image for larger version

Name:	pg347LaTeX.png
Views:	266
Size:	32.7 KB
ID:	122640

(and let me tell you, try not to start off with SUPER HARD books the first time. I keep on falling into these traps, I did the same exact thing when I first started making EPUBs. Tackling the hardest books under the sun first! ).

Quote:
Originally Posted by cadele View Post
However, I have developed a deep aversion to PDF as a format after all the slaving I have done to convert from it.
Same. I DESPISE PDF (which is why I want ALL books to be digitized/reflowable, and the text to be in a very portable/searchable form)... but, there are still areas where the current ebook formats are lacking (kerning, equations, vector images (SVG, AI, EPS, ...), Indexes, footnotes, ...).

As long as you have a really clean source document, going backwards to print shouldn't take too long (for example, I was able to generate that fiction PDF in ~15 minutes (once I tackle more books and get more used to the workflow, hopefully I can get this even faster)... non-fiction (which is nearly all the books I work on) is a different beast though, MUCH more complex and more time consuming).

Anyway, I have been carrying this conversation pretty far away from its original intent (discussion of OCR).... should probably carry this conversation on elsewhere. Perhaps we can discuss over PM. I would love to teach my methods, it would really help me refine my materials, and it might motivate me to get back into doing more Tutorials!

Last edited by Tex2002ans; 05-05-2014 at 04:03 AM.
Tex2002ans is offline   Reply With Quote