OCR engine - Page 4

Hamlet53 · 04-08-2014, 07:23 PM

Quote:

Originally Posted by cadele

Oh, I'm very tempted by this. I am currently using the flatbed scanner at work during my lunch break but it's a bit of a pain.

Are you cutting off the spine of the books, and if so how "neat" do you have to be? I just wonder if the scanner can handle slightly ragged edges.

Thanks!

Yes, I cut the spine away to feed loose pages. The first book that I did this to I actually tried just using a cutting board, a straight edge, and a utility knife. That's a slow tedious process, I found that I could not get a good cut if I tried more than 10-15 pages at a time. I have a power table saw and so now what I do is tightly clamp the book between to pieces of wood with about 1/4” of the book at the binding protruding. Then I just slice that off with the power saw. It's not a perfect smooth cut like HarryT's suggestion will produce, but its good enough; the cut is straight, even, and the paper is left with only slightly rough edges. It does not have to be perfect, just good enough that the pages do not catch or stick together. However the binding is cut away it is a good idea to separate the pages and then stack them into the pile to be fed to the scanner.

The scanning and OCR process to produce a text file is fast. I can get that done for a ~400 page book in less than an hour. It's the proofing that takes me time. Then I want everything to match the original, even quotation marks and apostrophes.

cadele · 04-08-2014, 10:46 PM

Thanks very much HarryT and Hamlet53. I might try giving a book 'the chop' and seeing how it goes before lashing out on the scanner. If it doesn't work out too well I can still use the flatbed scanner.

I don't have a nearby friendly printer with a guillotine unfortunately, but I do have a friend with a saw....

cadele · 04-08-2014, 10:53 PM

Quote:

Originally Posted by Hamlet53

The scanning and OCR process to produce a text file is fast. I can get that done for a ~400 page book in less than an hour. It's the proofing that takes me time. Then I want everything to match the original, even quotation marks and apostrophes.

I am the same. I like the book to be exactly as the print version.

Now that I have Abbyy to do the OCR it has cut down enormously on the proofing, but it still takes ages. I make a special point not to calculate how many hours this takes me.

What I really need (after a good duplex scanner) is a cheat sheet of regex to cut down the proofing. Unfortunately I struggle with that - my mind is Teflon when it comes to regex

cadele · 04-24-2014, 12:44 AM

Quote:

Originally Posted by Hamlet53

I purchased this scanner: Fujitsu ScanSnap S1300i Instant PDF Sheet-Fed Mobile Document Scanner It's worked very well for me.

Thank you for this recommendation. Unbelievably this was available through Officeworks in Australia and guess what the Easter Bunny brought me...

Scans wonderfully and the footprint is tiny. Only bugbear is the lack of documentation and subsequent learning curve.

I have tried a Stanley/utility knife with a metal rule for cutting off the spine and so far so good - and no injuries as yet, either

alg2468 · 04-28-2014, 02:09 PM

I find great results with Irisscan ReadIris Software, then importing it into Corel Wordperfect 6 for easy editing and saving of documents.

AJ Starr · 04-29-2014, 09:34 AM

Quote:

Originally Posted by alg2468

I find great results with Irisscan ReadIris Software, then importing it into Corel Wordperfect 6 for easy editing and saving of documents.

for someone else using WordPerfect. I've used it since 4.2 and prefer it over any other (i.e., Word)

By the by, I just received (yesterday) my Brother 720D mobile double page scanner. It says it scans to OCR (which my current all-in-one does not)

I'll let you know how it performs in a few days.

AJ

Tex2002ans · 04-29-2014, 09:06 PM

Quote:

Originally Posted by cadele

Now that I have Abbyy to do the OCR it has cut down enormously on the proofing, but it still takes ages. I make a special point not to calculate how many hours this takes me.

You should try to keep track of hours, it is quite interesting seeing how much faster/better you get at creating the ebooks.

As I mentioned, it used to take me two weeks of work to go from PDF -> finished EPUB, now I pump out the typical non-fiction economics book in ~8-15 hours.

Side Note: I have a bunch of stats I have been gathering, maybe when I get some more free time I will create a topic on MobileRead showing off the "research". Haven't touched the spreadsheets since March (and still have a ton more info to add to it).

Here is a preview of the Hours to convert + word count of books since I started keeping in-depth track of my hours (~October 2012):

Click image for larger version

Name: HourstoConvert.png
Views: 221
Size: 6.3 KB
ID: 122328

Click image for larger version

Name: TotalWordsPerBook.png
Views: 256
Size: 8.7 KB
ID: 122329

and here is the word count of all books I have converted to EPUB:

Click image for larger version

Name: TotalWordsPerBook.(All.Encompassing).png
Views: 250
Size: 8.7 KB
ID: 122331

Quote:

Originally Posted by cadele

What I really need (after a good duplex scanner) is a cheat sheet of regex to cut down the proofing. Unfortunately I struggle with that - my mind is Teflon when it comes to regex

What is your current process.

Are you just using Finereader to OCR and output to DOC, and then do your proofing there? If you use Microsoft Word, your best bet would probably be to use Toxaris's tools: https://www.mobileread.com/forums/sho...d.php?t=213372

Or are you fixing mistakes in Finereader beforehand (this is my method, since it is very easy to A/B compare). Then doing your more thorough checking elsewhere? (I personally export from Finereader -> EPUB -> Sigil, and then do all the regex/fixing + final spellchecking there).

Quote:

Originally Posted by AJ Starr

It says it scans to OCR (which my current all-in-one does not)

The disadvantage of using the OCR that comes with the device is that they will be using old/obsolete versions of the software.

For example, if you bought a scanner from Year ####, the scanner might come with Adobe Acrobat 7's OCR. (Since the scanner was made, versions 8+ have come out).

Same with the OCRed documents off of Archive.org, they OCR the book at the time of submission (so lets say the book was scanned in 2007, it would be using whatever version of Finereader was around in 2007).

Newer versions of the OCR software most likely have more accurate hyphenation/layout/page/table algorithms, larger dictionaries, more accurate recognition of font/accents/italics/bold/superscript/subscript, etc. etc.

If you wanted more accuracy, your best bet would just be rerunning the documents through whatever the newest version is of the software. So for Archive.org, downloading the source document and re-OCR it using Finereader 11 or 12 will give you a much better starting point.

DebbyS · 05-02-2014, 11:39 PM

From time to time in my work I scan books for a local publisher who will eventually turn them into ebooks.

I dismantle the book, just carefully tearing it apart -- unless it's a signed copy, then I'll use a flatbed scanner because I respect signed copies too much! this has only happened once though.

But for ordinary books I dismantle them (taking off cover, carefully tearing out pages, using scissors to take care of rough edges), and then I use an Epson GT-S80 (has a page feeder and can scan both sides of a page at once).

The scanner is talked to via a rather older version of a program called Paperport, which has a built-in OCR capability (that I think came from IBM? I don't have it on right now to check that -- Oh, TextBridge, I think). Naturally, depending on the book, the OCR can be quite good or... not.

The most recent book I'm doing has a very tiny font and while it is in English, it also has words in both Spanish and several Native Mexican languages, as well as even tinier endnote superscript number which often come out as quote marks. Whoever designed the book gave it a huge blank left margin, forcing the font to be small, I guess.

A few years ago I ran across an article on the net about making proofing easier, and they suggested using the free True Type font "DPCustomMono". This font helps the proofer know whether the OCR program has mistaken an l ("el") for a l ("one"), an Oh for a Zero, and all that. It also is a larger font by nature, so easy to read. For the book I'm doing now, I read a paragraph or two, make sure the non English words are italicized (like the book), that numbers are right and so forth, and everything is spelled right (or I point out original typos) and formatted okay. I block the paragraph(s) and use a macro attached to an icon on my tool bar (I'm using Word 2007) to turn the DPCustomMono to Times New Roman 12. If I had used TNR to begin with, I probably would miss a lot, particularly when "l" (el) is used in a date, such as 196O or l96o rather than 1900 (the book has small zeros, which confuses the OCR).

So for anyone actually proofreading, consider I suggest using "DPCustomMono" to maybe speed things up

Tex2002ans · 05-03-2014, 02:46 AM

Quote:

Originally Posted by DebbyS

So for anyone actually proofreading, consider I suggest using "DPCustomMono" to maybe speed things up

That is a great tip for those who proofread with their eyes! (I do most of my fixing with regex + a quick pass with my eyes).

That font was recommended at Distributed Proofreaders. There is a page showing off this font compared to some others:

http://www.pgdp.net/c/faq/font_sample.php

Quote:

Originally Posted by DebbyS

If I had used TNR to begin with, I probably would miss a lot, particularly when "l" (el) is used in a date, such as 196O or l96o rather than 1900 (the book has small zeros, which confuses the OCR).

That is a very common error from OCR, and is pretty hard to spot with your just your eyes in most fonts.

I use these four Regexes to catch those (I have these in my Saved Searches in Sigil, and then I just go through quickly one-by-one and decide on a case-by-case basis):

Search: [l]([0-9])
Replace: 1\1

Search: ([0-9])[l]
Replace: \11

Search: [oO]([0-9])
Replace: 0\1

Search: ([0-9])[oO]
Replace: \10

I believe Word uses a completely different Regex engine, but the spirit should be the same.

Quote:

Originally Posted by DebbyS

From time to time in my work I scan books for a local publisher who will eventually turn them into ebooks.

Fantastic, keep up the good work.

All the books must be digitized!

DebbyS · 05-03-2014, 06:12 PM

For my current project, I did a search for "any digit"o [any digit + oh] so I could see if the "o" should be "0" (zero). The OCR was also italicizing words it shouldn't have, but it was largely extending italicized words in the Huichol and Spanish languages to the next few English words, so in the end I'll search for [blank] italicized to see if I missed any, as well as searching for [blank] [DPCustomMono] and trade that for Times New Roman. Accented "o" (oh) tends to become a "6", too, but that's easy to see. I'm sure if the font in the book had been larger than 10point or so, the accuracy of the OCR would have been much better. I'm really glad to have that weird font to use, but will also check into "regex" to see what it is and if I can use it as well

Tex2002ans · 05-03-2014, 08:48 PM

Quote:

Originally Posted by DebbyS

[...] but will also check into "regex" to see what it is and if I can use it as well

"Regex" = shorthand for "Regular Expressions".

It is a way to do "variable searches". So you can do things like:

Search: ([0-9])-([0-9])
Replace: \1–\2

Which says "Look for a number 0 through 9 and 'capture it' in \1 + hyphen + a number 0 through 9 and 'capture it' in \2."

Replace it with "the number that was captured in \1 + EN DASH + whatever number was captured in \2".

Or I also use:

Search: [ ][b-z][ ]

Which says "Look for a SPACE + a single lowercase letter 'b' through 'z' + SPACE".

Typically in english, the only letter that is lowercase that is by itself is the word "a". Besides that, it is most likely an OCR error.

Or I also use this one:

Search: [0-9]{5,}

Which says "look for 5 or more numbers in a row".

Usually only Zip Codes are 5 digits or more, but in all the other cases, it is usually a missing punctuation mark in a large number due to the OCR. For example "20000" -> "20,000".

With Regex, you typically want to be VERY careful, and never press "Replace All" (unless you know EXACTLY what you are doing). I always do single "Find/Replace", and undo/redo, just to double-check and make sure that it is doing what you want.

And with many of these Regex, I just use them to help point out places that have very common errors (like those single lowercase b-z).

This is what I mean when I say using Regex to proofread is a lot faster, and it helps cut down drastically the amount of errors you would have to find/fix on your own.

Here is a great resource to learn Regex: http://www.regular-expressions.info/tutorial.html

There is also this topic on the Sigil forum where they gathered a lot (although be aware, some of these are quite arcane): https://www.mobileread.com/forums/sho...d.php?t=167971

I am not too familiar with whatever Regex is used in Microsoft Word (I don't use Microsoft Word), but as I stated, the "idea" behind many of them are the same. For example, instead of using the symbol '^' for NOT, Word might use '!' instead.

Here is one of the first things that popped up when searching Microsoft Office Regex: https://office.microsoft.com/en-us/h...001087305.aspx

Quote:

Originally Posted by DebbyS

For my current project, I did a search for "any digit"o [any digit + oh] so I could see if the "o" should be "0" (zero).

Yep yep, it sounds like you are tackling something similar as well in Word already (just don't forget to take into account CAPITAL letter 'O' as well). Now you just have to step the complexity level one step up and save yourself more work!

Quote:

Originally Posted by DebbyS

The OCR was also italicizing words it shouldn't have, but it was largely extending italicized words in the Huichol and Spanish languages to the next few English words,

Hmmm... in this book, is it typically only ONE Huichol or Spanish word that is in italics, or is it a whole Huichol and Spanish phrase, followed by English words?

cadele · 05-04-2014, 04:15 AM

Quote:

Originally Posted by Tex2002ans

You should try to keep track of hours, it is quite interesting seeing how much faster/better you get at creating the ebooks.

. . .

What is your current process.

Are you just using Finereader to OCR and output to DOC, and then do your proofing there? If you use Microsoft Word, your best bet would probably be to use Toxaris's tools: https://www.mobileread.com/forums/sho...d.php?t=213372

Or are you fixing mistakes in Finereader beforehand (this is my method, since it is very easy to A/B compare). Then doing your more thorough checking elsewhere? (I personally export from Finereader -> EPUB -> Sigil, and then do all the regex/fixing + final spellchecking there).

The disadvantage of using the OCR that comes with the device is that they will be using old/obsolete versions of the software.

For example, if you bought a scanner from Year ####, the scanner might come with Adobe Acrobat 7's OCR. (Since the scanner was made, versions 8+ have come out).

Same with the OCRed documents off of Archive.org, they OCR the book at the time of submission (so lets say the book was scanned in 2007, it would be using whatever version of Finereader was around in 2007).

Newer versions of the OCR software most likely have more accurate hyphenation/layout/page/table algorithms, larger dictionaries, more accurate recognition of font/accents/italics/bold/superscript/subscript, etc. etc.

If you wanted more accuracy, your best bet would just be rerunning the documents through whatever the newest version is of the software. So for Archive.org, downloading the source document and re-OCR it using Finereader 11 or 12 will give you a much better starting point.

I am going to start to keep some stats - you have inspired me!

My process has improved a bit. I now cut the spine off the book and run the pages through the scansnap (unless I want to preserve the book, in which case it is the dreaded flatbed scanner at work during my lunch break).

Then I open the file in Abbyy Finereader 12 and verify the text. This is slow but worth it. I then convert it to a Word document. Following that I set up my page size and layout. I usually try to match the book's general layout without being too OCD about it.

Then I start reading and correcting. I do run a list of search and replace for common OCR errors that I have come up against. Once I finish that I will use Word's spellcheck just to pick up what I have missed.

Then I add a TOC - the Stone Age way by inserting bookmarks then hyperlinks

(I must learn how to do this in Calibre, it's getting ridiculous!).

Finally I add the book to Calibre, download the metadata and add the cover, then convert it to EPub and Mobi (both types of Mobi).

Oh, and then I back it up. Thud.

Tex2002ans · 05-04-2014, 06:39 PM

Quote:

Originally Posted by cadele

I am going to start to keep some stats - you have inspired me!

Glad to hear I have inspired someone else to start keeping stats. I love keeping stats on things that I do. You can get cool things like this:

I liberated 18,172,166 words from PDF -> EPUB since October 2012. (Although I haven't updated the stats in about a month).

And EPUBs that I read for pleasure + cleaned as I went along, 2,870,128 words.

I will have to go through and add in a Page Count to all of the books as well... that might also lead to some decent stats/graphs. (Although in my opinion, pages are a horrible way to measure. A page of non-fiction =/= a page of fiction =/= a page out of a journal/newspaper =/= a page in different font/font-size/margins). And how would you go about handling measuring "Pages" of text from an HTML source?

Quote:

Originally Posted by cadele

Then I open the file in Abbyy Finereader 12 and verify the text. This is slow but worth it. I then convert it to a Word document. Following that I set up my page size and layout. I usually try to match the book's general layout without being too OCD about it.

Sounds ok. I guess different workflows for different people.

I personally just do all the fixing in minimalist HTML (EPUB) AND THEN, can go back to other formats if needed.

DOC is really a horrible/bloated "source" format. Too much cruft and inconsistencies added in because of the WYSIWYG editing.

And speaking of trying to "match page size/layout"... Here is a sample of some of my latest ventures into working backwards from EPUB -> LaTeX -> PDF:

Click image for larger version

Name: pg022Before.png
Views: 285
Size: 60.5 KB
ID: 122624

Click image for larger version

Name: pg022LaTeX.png
Views: 255
Size: 29.2 KB
ID: 122625

Click image for larger version

Name: pg093Before.png
Views: 261
Size: 65.0 KB
ID: 122626

Click image for larger version

Name: pg093LaTeX.png
Views: 275
Size: 30.7 KB
ID: 122627

Click image for larger version

Name: pg119Before.png
Views: 263
Size: 63.7 KB
ID: 122628

Click image for larger version

Name: pg119LaTeX.png
Views: 265
Size: 28.7 KB
ID: 122629

Click image for larger version

Name: pg209Before.png
Views: 267
Size: 54.6 KB
ID: 122630

Click image for larger version

Name: pg209LaTeX.png
Views: 274
Size: 25.5 KB
ID: 122631

I still have to iron out a few kinks... but I have the basics of the workflow going... now I just have a lot more to learn/absorb/code.

Quote:

Originally Posted by cadele

Finally I add the book to Calibre, download the metadata and add the cover, then convert it to EPub and Mobi (both types of Mobi).

Hmmm... so a DOC -> Calibre -> EPUB/MOBI conversion? Does that give you the cleanest output?

I probably sound like a broken record, but why not use Toxaris's Word Macro?

cadele · 05-04-2014, 11:50 PM

I did try and download Toxaris's Word Macro but unfortunately my antivirus spat the dummy (pacifier) and wouldn't allow it. I haven't gone back so far to try again.

Using Word is probably the worst way of doing things but at least I am on familiar ground with it, and at least now that I have Abbyy 12 and the ScanSnap the many hours have been cut down enormously. It actually doesn't take too long to proof in Word and I am keeping a list of common S&R - particularly annoying things like quotation marks and apostrophe's, and the dreaded 0 vs O, 1 instead of l etc.

At one point I was using Atlantis which can convert from a word document to an EPUB, but I found that if there were drop caps etc then Calibre did a better conversion from docx to EPUB and/or Mobi.

I don't have Sigil - I didn't think I could master it very well since I can't code or figure out even basic regex (much as I would like to be able to do this).

As far as the stats go, I am going to record the time taken and the word count, and probably whether there were any pictures etc which slow things up with the formatting.

Your samples look good! However, I have developed a deep aversion to PDF as a format after all the slaving I have done to convert from it.

Tex2002ans · 05-05-2014, 03:56 AM

Quote:

Originally Posted by cadele

Using Word is probably the worst way of doing things but at least I am on familiar ground with it, and at least now that I have Abbyy 12 and the ScanSnap the many hours have been cut down enormously.

It isn't the worst, lots of people on the boards still use Word somewhere in their workflow.

Sadly, I can't give any tips to speed that entire section of a workflow up since I have zero experience in it.

Quote:

Originally Posted by cadele

It actually doesn't take too long to proof in Word and I am keeping a list of common S&R - particularly annoying things like quotation marks and apostrophe's, and the dreaded 0 vs O, 1 instead of l etc.

Do you just have a list, and you manually copy/paste/search, copy/paste/search, copy/paste/search? Or is there some sort of method where you can mass run a bunch of searches?

For example, in Sigil, there is "Saved Searches": https://web.sigil.googlecode.com/git..._searches.html

And I hear that Calibre's Editor just recently added similar functionality as well.

Quote:

Originally Posted by cadele

I don't have Sigil - I didn't think I could master it very well since I can't code or figure out even basic regex (much as I would like to be able to do this).

Bah, stop being so negative about your skills! You can do both!

HTML can be a little scary in the beginning, but if you keep everything super clean/simple (as I do), it is easy!

Quote:

Chapter 1

This is a sample sentence with bold and italic words.

This is a sample of a sentence in a blockquote.
This is a sample of a second sentence in a blockquote.

Changes into:

Quote:

<h2>Chapter 1</h2>

This is a sample sentence with bold and italic words.

<blockquote>
This is a sample of a sentence in a blockquote.
This is a sample of a second sentence in a blockquote.
</blockquote>

Regex can be scary in the beginning, but I don't think the ones I posted above are TOO scary... and they are extremely helpful.

So you just start out with the super basic ones, and then you build up piece by piece from there. 5 or more numbers in a row?

How about you try to get 4 or more numbers in a row?
Or pointing out instances of ONLY 3 numbers in a row?
Or try to get 4 numbers in a row followed by a comma?

Quote:

Originally Posted by cadele

As far as the stats go, I am going to record the time taken and the word count, and probably whether there were any pictures etc which slow things up with the formatting.

These are the stats that I am planning on keeping for every book I convert:

Word Count
Hours to Convert
Words Per Hour (WPH)
- Derived from Words/Hours
Hours spent on overhead (Email, Changelogs, etc. etc.)
# of Pictures/Figures
# of Footnotes
- Endnotes/Footnotes?
- Symboled Footnotes? (*, †, ‡, §, ‖, ¶)
- (Endnotes are typically faster than footnotes at the bottom of each page, and Symboled Footnotes are SIGNIFICANTLY slower).

I might think of a few more some time. If I ever get into actually scanning the physical books, I will probably create a keep track of those hours separately as well. And if I ever get more into vectorizing charts/graphs, I will keep track of those as well.

I should also keep track of how long it takes me to actually read books... it is always interesting to see those stats! I currently keep track of all of my hours spent playing Video Games, and that is extremely helpful/useful.

Quote:

Originally Posted by cadele

Your samples look good!

Thank you, I have been fishing around those PDFs/sample images the past few weeks, and it DEFINITELY blows the pants off of many of the scans that are currently out there. The few companies I do EPUB work for were definitely impressed with the quality of the PDFs....

(But as I said, I still have A TON to learn).

This method of PDF creation might be very nice in the cases where the condition of the original/older scans wasn't the greatest, (there might be writing/markings in the book, yellowed pages, water stains, ink blots, margins cut off, etc. etc.) (Take a gander at many of the Archive.org PDFs).

And these PDFs will DEMOLISH the current reprinted junk that is out there (scan -> slap on front/backmatter -> reprint, or scan -> very minor speckle cleanup -> slap on front/backmatter -> reprint).

Also, for those who DO want to read the PDF over an EPUB (I don't know who would be crazy enough to do this.

), then this LaTeX generated PDF will destroy the crappy PDF scans.

For example, here is some comparison shots of the first PDF I tackled using this method:

Click image for larger version

Name: pg101Before.png
Views: 250
Size: 61.6 KB
ID: 122635

Click image for larger version

Name: pg101LaTeX.png
Views: 267
Size: 36.6 KB
ID: 122636

Click image for larger version

Name: pg261Before.png
Views: 245
Size: 52.6 KB
ID: 122637

Click image for larger version

Name: pg261LaTeX.png
Views: 244
Size: 28.9 KB
ID: 122638

Click image for larger version

Name: pg347Before.png
Views: 274
Size: 57.6 KB
ID: 122639

Click image for larger version

Name: pg347LaTeX.png
Views: 274
Size: 32.7 KB
ID: 122640

(and let me tell you, try not to start off with SUPER HARD books the first time. I keep on falling into these traps, I did the same exact thing when I first started making EPUBs. Tackling the hardest books under the sun first!

).

Quote:

Originally Posted by cadele

However, I have developed a deep aversion to PDF as a format after all the slaving I have done to convert from it.

Same. I DESPISE PDF (which is why I want ALL books to be digitized/reflowable, and the text to be in a very portable/searchable form)... but, there are still areas where the current ebook formats are lacking (kerning, equations, vector images (SVG, AI, EPS, ...), Indexes, footnotes, ...).

As long as you have a really clean source document, going backwards to print shouldn't take too long (for example, I was able to generate that fiction PDF in ~15 minutes (once I tackle more books and get more used to the workflow, hopefully I can get this even faster)... non-fiction (which is nearly all the books I work on) is a different beast though, MUCH more complex and more time consuming).

Anyway, I have been carrying this conversation pretty far away from its original intent (discussion of OCR).... should probably carry this conversation on elsewhere.

Perhaps we can discuss over PM. I would love to teach my methods, it would really help me refine my materials, and it might motivate me to get back into doing more Tutorials!

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Regex engine	huebi	Sigil	1	02-23-2012 02:53 AM
How to convert an OCR file to a Non-OCR one	res9282	PDF	1	08-05-2011 05:58 AM
Search Engine	alroy	Calibre	1	11-06-2010 01:39 AM
Regex engine?	troymc	Sigil	10	07-09-2010 04:52 PM

04-08-2014, 10:46 PM	#47
cadele Addict Posts: 372 Karma: 3710372 Join Date: Feb 2010 Device: Kindles, Sony 650	Thanks very much HarryT and Hamlet53. I might try giving a book 'the chop' and seeing how it goes before lashing out on the scanner. If it doesn't work out too well I can still use the flatbed scanner. I don't have a nearby friendly printer with a guillotine unfortunately, but I do have a friend with a saw....

04-28-2014, 02:09 PM	#50
alg2468 Member Posts: 22 Karma: 10 Join Date: Oct 2011 Location: RI, USA Device: Aluratek Libre, Velocity Cruz T301, EZReader, Iview 435TPC, Wikireader	I find great results with Irisscan ReadIris Software, then importing it into Corel Wordperfect 6 for easy editing and saving of documents.

05-02-2014, 11:39 PM	#53
DebbyS Zealot Posts: 115 Karma: 1472692 Join Date: Jul 2011 Location: Albuquerque, NM Device: Jetbook Lite; Samsung Galaxy Tab 2 (7.0)	From time to time in my work I scan books for a local publisher who will eventually turn them into ebooks. I dismantle the book, just carefully tearing it apart -- unless it's a signed copy, then I'll use a flatbed scanner because I respect signed copies too much! this has only happened once though. But for ordinary books I dismantle them (taking off cover, carefully tearing out pages, using scissors to take care of rough edges), and then I use an Epson GT-S80 (has a page feeder and can scan both sides of a page at once). The scanner is talked to via a rather older version of a program called Paperport, which has a built-in OCR capability (that I think came from IBM? I don't have it on right now to check that -- Oh, TextBridge, I think). Naturally, depending on the book, the OCR can be quite good or... not. The most recent book I'm doing has a very tiny font and while it is in English, it also has words in both Spanish and several Native Mexican languages, as well as even tinier endnote superscript number which often come out as quote marks. Whoever designed the book gave it a huge blank left margin, forcing the font to be small, I guess. A few years ago I ran across an article on the net about making proofing easier, and they suggested using the free True Type font "DPCustomMono". This font helps the proofer know whether the OCR program has mistaken an l ("el") for a l ("one"), an Oh for a Zero, and all that. It also is a larger font by nature, so easy to read. For the book I'm doing now, I read a paragraph or two, make sure the non English words are italicized (like the book), that numbers are right and so forth, and everything is spelled right (or I point out original typos) and formatted okay. I block the paragraph(s) and use a macro attached to an icon on my tool bar (I'm using Word 2007) to turn the DPCustomMono to Times New Roman 12. If I had used TNR to begin with, I probably would miss a lot, particularly when "l" (el) is used in a date, such as 196O or l96o rather than 1900 (the book has small zeros, which confuses the OCR). So for anyone actually proofreading, consider I suggest using "DPCustomMono" to maybe speed things up

05-03-2014, 06:12 PM	#55
DebbyS Zealot Posts: 115 Karma: 1472692 Join Date: Jul 2011 Location: Albuquerque, NM Device: Jetbook Lite; Samsung Galaxy Tab 2 (7.0)	For my current project, I did a search for "any digit"o [any digit + oh] so I could see if the "o" should be "0" (zero). The OCR was also italicizing words it shouldn't have, but it was largely extending italicized words in the Huichol and Spanish languages to the next few English words, so in the end I'll search for [blank] italicized to see if I missed any, as well as searching for [blank] [DPCustomMono] and trade that for Times New Roman. Accented "o" (oh) tends to become a "6", too, but that's easy to see. I'm sure if the font in the book had been larger than 10point or so, the accuracy of the OCR would have been much better. I'm really glad to have that weird font to use, but will also check into "regex" to see what it is and if I can use it as well

05-04-2014, 11:50 PM	#59
cadele Addict Posts: 372 Karma: 3710372 Join Date: Feb 2010 Device: Kindles, Sony 650	I did try and download Toxaris's Word Macro but unfortunately my antivirus spat the dummy (pacifier) and wouldn't allow it. I haven't gone back so far to try again. Using Word is probably the worst way of doing things but at least I am on familiar ground with it, and at least now that I have Abbyy 12 and the ScanSnap the many hours have been cut down enormously. It actually doesn't take too long to proof in Word and I am keeping a list of common S&R - particularly annoying things like quotation marks and apostrophe's, and the dreaded 0 vs O, 1 instead of l etc. At one point I was using Atlantis which can convert from a word document to an EPUB, but I found that if there were drop caps etc then Calibre did a better conversion from docx to EPUB and/or Mobi. I don't have Sigil - I didn't think I could master it very well since I can't code or figure out even basic regex (much as I would like to be able to do this). As far as the stats go, I am going to record the time taken and the word count, and probably whether there were any pictures etc which slow things up with the formatting. Your samples look good! However, I have developed a deep aversion to PDF as a format after all the slaving I have done to convert from it.

Advert

Advert