Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 07-08-2016, 06:13 PM   #16
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 21,737
Karma: 29711016
Join Date: Mar 2012
Location: Sydney Australia
Device: none
@Doitsu - thanks for the tips re use of Grammar Checker, I was planning on trying it over weekend.

Having cut my teeth on Algol and various assemblers, I find it 'interesting' that your sample combines the ultra-verbosity of XML with maxi-terseness of regex. Not your fault of course, its the way of the world as it is - more given to extremes.

FWIW - I use the calibre editor Reports to: a) eyeball the Words list filtered on '-' (this morning I found 'them-selves' and 'Wag-nails'), b) scan the Character list for 'odd-ball' characters. The ability to sort the various lists on frequency is helpful, as is the facility to save a list to a csv.

BR
BetterRed is online now   Reply With Quote
Old 07-08-2016, 10:35 PM   #17
AlexBell
Wizard
AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.
 
AlexBell's Avatar
 
Posts: 3,413
Karma: 13369310
Join Date: May 2008
Location: Launceston, Tasmania
Device: Sony PRS T3, Kobo Glo, Kindle Touch, iPad, Samsung SB 2 tablet
Thanks for all your responses - I've been so busy proofreading that I haven't been back on the forum. I obviously have a lot to read through and digest.

I may need to consider my method of doing ebooks. I don't use Word; I use LibreOffice. I don't use Sigil; I use the CoffeeCup HTML editor I used when I dabbled in website design years ago - though I often use Notepad++ first. CoffeeCup does have a spell checker

My usual practice is to find I book I want to do on the Internet Archive or elsewhere, and download the pdf and ePub 'ebook' files. Then I open up the ePub ebook file and edit the HTML files within.

Another thing I've learned recently is to take more trouble to find the best possible original file. Some of the pdfs on the Internet Archive are just awful, making it much more likely that I'll read a comma as a period, and vice versa and so on. So where I can I use the HathiTrust version to check against - their version is usually excellent, and they seem to have nearly everything I want. But of course one cannot download the files.

Thanks for the suggestions about fonts; I normally use Amasis when proofreading, but will experiment with the other font families and see if that helps.
AlexBell is offline   Reply With Quote
Old 07-08-2016, 11:16 PM   #18
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Doitsu View Post
Unfortunately, neither the default LibreOffice grammar checker nor LanguageTool caught the error in the paragraph that you posted. Do you use custom settings or did you get the warning from the MS Word grammar checker?
That typo was caught by Microsoft Word's grammar check.

I must admit, I haven't touched LibreOffice in a while (I just use Notepad++ for all my writing). But the more types of tools you can throw at it, the better (certain tools might catch errors that others might miss).

Quote:
Originally Posted by Doitsu View Post
Shameless plug: In my quest to corner the Sigil wrapper plugin market, I released a LanguageTool (grammar check) validation plugin. (Validation plugin means that it'll display warnings like FlightCrew and not like LibreOffice or the standalone LanguageTool version.)
Hmmmm, very interesting. Well you convinced me to download the latest version of Sigil and test it out (I was holding out on 0.8.6 for a while while the dust settled).

The plugin felt quite rough:
  • If you doubleclick on the error, it jumps you to the paragraph (not necessarily the EXACT position the error is located)
    • Long paragraphs make it very hard to spot where the error actually occurred.
    • Is there any possible way for it to highlight the exact position in the text?
  • Is there a way to split the messages into more columns? It is quite hard to read these errors and figure out WHAT exactly it is complaining about.
    • Currently: File + Line + MessageJammedIntoOneGiantLine
    • Potential: File + Line + Reason + Sentence + Suggestion
  • Is there any possible way for it to run on the entire EPUB at once? Or am I just crazy? (Or didn't read your instructions properly). Currently I am just running it a chapter at a time.
  • Any stats/thoughts on adding the n-gram data?
    • Is there anywhere I could test this beforehand before trying to download the 8 GB beast? :P
    • About how many more errors will this point out? How many more false positives might I have to sift through, or does it do a pretty good job?

Quote:
Originally Posted by GrannyGrump View Post
@Doitsu --- I am looking forward to trying the plugin this weekend. Glad I will no longer have to schlepp my files to work and paste into MS Word for their grammar check.
PASTE into MS Word? You do know Toxaris's Tools now have "Import EPUB"? Or you could do what I did before Toxaris introduced that... Calibre convert EPUB -> RTF/HTMLZ/DOCX + Open it up in your Word Processor of choice.

Quote:
Originally Posted by BetterRed View Post
a) eyeball the Words list filtered on '-' (this morning I found 'them-selves' and 'Wag-nails'),
You are welcome.

Quote:
Originally Posted by BetterRed View Post
b) scan the Character list for 'odd-ball' characters.
Yes, this is one of the first steps I do after I OCR the book. Who knows what crazy characters might have snuck in (or accents on characters). I then go through the book and check every odd/accented character to doublecheck they are correct. Doing this pass also helps you potentially catch inconsistencies like "vis-à-vis" + "vis-a-vis" existing in the same book.

Side Note: Before Toxaris comes swooping in here, yes, his EPUB Tools also has "Check Accents".

Quote:
Originally Posted by BetterRed View Post
The ability to sort the various lists on frequency is helpful, as is the facility to save a list to a csv.
Exporting to CSV has also been recently added to my repertoire (within the last few months). If anything of substance comes out of that research, I will also post that info on MobileRead. (Already caught a few typos that slipped by in my previous passes). :P

Again, just a different way to visualize the data might make discrepancies stand out like a sore thumb.

Last edited by Tex2002ans; 07-09-2016 at 12:38 AM.
Tex2002ans is offline   Reply With Quote
Old 07-09-2016, 05:51 AM   #19
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,549
Karma: 19500001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Quote:
Originally Posted by GrannyGrump View Post
@Jellby --- I've taken a quick test drive with DP Custom Mono 2.
Is it supposed to look so rough?
Is it intended to look rough? I don't think so, but it could be, maybe the reasoning is that an ugly font will let you focus on the letters more and not on the meaning.

Is there something "wrong" in your system? I don't think so either. I haven't actually tried the font (maybe once long ago), but it takes a fair amount of work and knowledge to make a nice and smooth font, and/or a sophisticate software. I guess the creators of the font didn't have either of them.
Jellby is offline   Reply With Quote
Old 07-09-2016, 06:27 AM   #20
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,730
Karma: 24031401
Join Date: Dec 2010
Device: Kindle PW2
Quote:
Originally Posted by Tex2002ans View Post
It felt quite rough:
It is indeed a bit rough, but it was the best I could do with my very limited Python skills.
BTW, I found a Windows bug related to the ngram spellcheck feature that required a minor update. If you want to experiment with ngrams, you'll need to install the latest version.

As for your questions:

Quote:
Originally Posted by Tex2002ans View Post
Is there any possible way for it to highlight the exact position in the text?
Only if I hard-coded some kind of highlight style, that you'd have to remove from the many false positives.

This feature might be easier to implement in Calibre, because it's based on Python.
Maybe Kovid Goyal will implement it, if you ask him nicely.

I'll also ask KevinH, whether he could add some kind of Python-accessible highlight function, but since that would probably require a lot of work and not that many people are interested in this plugin, it's not very likely to happen.

Quote:
Originally Posted by Tex2002ans View Post
Is there a way to split the messages into more columns?
Unfortunately, the software module used for validation messages doesn't support multi-line text.

Quote:
Originally Posted by Tex2002ans View Post
Is there any possible way for it to run on the entire EPUB at once? Or am I just crazy? (Or didn't read your instructions properly).
Actually, my instructions were a bit unclear on that. By default the plugin will only check the currently selected file. If you want to check all files, either select all files or none (e.g., select the Text folder). You can also force the plugin to always check all files by changing the following value in LanguageTool.json.

Code:
"allFiles": true
(If it's not the last entry, you'll also need to add a comma at the end.)

Quote:
Originally Posted by Tex2002ans View Post
Any stats/thoughts on adding the n-gram data?
It really slows LanguageTool down, but it did find some problems. It all depends on the texts that you want to check.

Quote:
Originally Posted by Tex2002ans View Post
How many more false positives might I have to sift through, or does it do a pretty good job?
It reports fewer false positives than the regular grammar check. I usually use it after the regular grammar check with a special LanguageTool.json file:

Code:
{
  "enabledOnly": true,
  "enabledRules": "CONFUSION_RULE",  
  "ngramIndexDir": "C:/ngrams",
  "ltPath": "C:/Program Files/LanguageTool-3.3/languagetool-commandline.jar", 
  "allFiles": true
}
With these settings LanguageTool will only run the ngram spellcheck. It's still rather slow.

If you want to experiment with the ngram spellcheak feature, you'll need to create a folder with an en subfolder in it and extract the ngram data files to that en folder. For example, on my machine the ngram files are in C:\ngrams\en (e.g. C:\ngrams\en\1grams).
As far as LanguageTool is concerned, ngrams is the ngram folder that you'll need to specify via ngramIndexDir.
Note also that you'll need to replace backslashes in folder names with slashes or write the backslash twice.
For example:

Code:
  "ngramIndexDir": "C:/ngrams",
or

Code:
    "ngramIndexDir": "C:\\ngrams",
BTW, the ngram spellcheck didn't flag "it original usefulness", but this could be easily added as a custom rule.

Last edited by Doitsu; 07-12-2016 at 07:07 AM. Reason: New version attached
Doitsu is offline   Reply With Quote
Old 07-10-2016, 10:27 PM   #21
GrannyGrump
Obsessively Dedicated...
GrannyGrump ought to be getting tired of karma fortunes by now.GrannyGrump ought to be getting tired of karma fortunes by now.GrannyGrump ought to be getting tired of karma fortunes by now.GrannyGrump ought to be getting tired of karma fortunes by now.GrannyGrump ought to be getting tired of karma fortunes by now.GrannyGrump ought to be getting tired of karma fortunes by now.GrannyGrump ought to be getting tired of karma fortunes by now.GrannyGrump ought to be getting tired of karma fortunes by now.GrannyGrump ought to be getting tired of karma fortunes by now.GrannyGrump ought to be getting tired of karma fortunes by now.GrannyGrump ought to be getting tired of karma fortunes by now.
 
GrannyGrump's Avatar
 
Posts: 3,221
Karma: 35037583
Join Date: May 2011
Location: PA {back in the usa!}
Device: Sony PRS-T2, ADE on PC
Well, after all the advanced technical discussions, this post is a bit like a mouse screaming at a lion, but here is a short list of frequent OCR errors I have come across. There are many more I have never noted down, but just fixed on the fly.

Maybe more folks can share their "little lists" for the edification of us all.

Some of these will be caught with spell-check,
but not all, by any means ...

OCR VILLAINS:
Spoiler:
0 <--> O {zero <--> Uppercase o}


1 l I i ! <--> each other
{digit One, lowercase L, uppercase i, lowercase i, exclamation mark}


2 <--> Z
5 <--> S
6 <--> uppercase G
7 <--> ? {question mark}
7 and / = I {uppercase I in italic}


e <--> c
are <--> arc

f ligatures confusion
ff, fi, fl, ffi

h <--> b
back <--> hack
harrow <--> barrow


H = ll
weH = well

H or h = li
Hbrary = library
hke = like

hn = lm
ahnost = almost


j <--> J {lowercase <--> uppercase J }
jane = Jane
Jury = jury


] = J
square bracket = uppercase J
]ane = Jane


rn <--> m
Mom <--> Morn
stem <--> stern
earnest = camest {this also had the e=c combo}


ri <--> n
arid <--> and

r = f
ringers = fingers


m <--> in
stein <--> stem
rmg = ring
inoth = moth


im <--> un
unport = import
imdone = undone


n <--> u
bnt = but
teut = tent
uest = nest


ii = u
iinder = under


B <--> R {uppercase}
DEABEST = DEAREST
Robby <--> Bobby

F <--> P {uppercase}
Full <--> Pull


ih = th
feaiher = feather

di = th {weird, but it happens a lot}
die = the

tii = th
tiie = the

tli = th
tlie = the


Tm == "I'm (also with no leading quote)
T = I {uppercase i}


U = double ell, li, il
WeU = Well
Ufe = life
untU = until


vv = w
vvhen = when

\V = W


y <--> v
yery = very
verv = very


/' = ," or .” {or single quote}

* = quote mark
** *' '*

'' = " {two single quotes, should be a double quote}

Space following opening quote mark
Space preceding closing quote or punctuation mark.
He did this ; then he did that ; then he said : “ You aren’t ready ! ”


Apostrophe goes missing, stranding the last letter
I m = I’m, don t = don’t, Bob s = Bob’s



@@@@@@@@@@@@@@@@@@@@@@@@@@
These following often occur with a "Smarten Punctuation" action:

Backward quote marks:
” close quote at start of paragraph
“ open quote at end of paragraph


Reversed single and double quotes in nested quotations:
“And I said to him, ‘Quit that!”’
‘“O what a tangled web we weave,’” she said.


’ Right single quote should replace "straight" apostrophe, not ‘ Left single quote. Happens often at start of a word:
‘em should be ’em, ‘tis should be ’tis
GrannyGrump is offline   Reply With Quote
Old 07-11-2016, 05:42 AM   #22
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by AlexBell View Post
Thanks for all your responses - I've been so busy proofreading that I haven't been back on the forum. I obviously have a lot to read through and digest.
You still have to tell us all those errors in your books!

Quote:
Originally Posted by Doitsu View Post
If you want to experiment with the ngram spellcheak feature, you'll need to create a folder with an en subfolder in it and extract the ngram data files to that en folder.
I'll have to do that some time in the future. Will definitely keep your plugin on my radar and run it on old books + see if I can point out any errors that it misses.

Quote:
Maybe more folks can share their "little lists" for the edification of us all.
I have been meaning to put together one of my "lists" for so long. Maybe in the coming weeks I will have to gather the info and actually do something about it this time.

Most of the information I have directly on hand is all of the actual book typos I have come across over the years.

I stopped writing down OCR errors so many years ago, and now could probably only gather them with code comparisons between EPUB versions as I worked on them.

Quote:
1 l I i ! <--> each other
{digit One, lowercase L, uppercase i, lowercase i, exclamation mark}
Speaking of my "I963" -> "1963" example, yesterday I caught "J969". There was a speck of dust in the PDF scan at the bottom left of the "1", which caused it to OCR as "J". It reminded me that I have seen this just due to normal OCR, although it is quite rare.

Quote:
U = double ell, li, il
WeU = Well
Ufe = life
untU = until
Typically when you OCR a book this entire "class" pops up, so you can easily spot it. If this occurs, I typically just put in a capital "U" into Sigil/Calibre Spellcheck.

There probably aren't many actual words in the book with a capital U in them, so they stick out like a sore thumb... especially if you sort the Spellcheck List by Case Sensitive Sort. Anything that starts with a lowercase letter and has an uppercase "U" in it is a mistake 99% of the time.

Side Note: That type of search is better in Calibre's Spellcheck because you can do a Case Sensitive Search.

Quote:
Space following opening quote mark
Space preceding closing quote or punctuation mark.
He did this ; then he did that ; then he said : “ You aren’t ready ! ”
Also want to pay attention to spaces before/after slashes. Quite often an error might creep in such as "and /or" + "and/ or".

Side Note: I even caught this in quite a few InDesign files as well. This is an easy error to slip by even in purely digital files.

Quote:
Apostrophe goes missing, stranding the last letter
I m = I’m, don t = don’t, Bob s = Bob’s
I typically run this Regex to catch all lowercase letters that are by themselves that are not "a":

Search: \s[b-z]\s

Similarly, I run this one too to catch all capital letters that are by themselves that are not "A" or "I":

Search: \s[B-HJ-Z]\s

Those basic Regexes do miss the odd case of that occurring anywhere near an HTML tag though. So it would miss:

<p>B ob said to go outside!</p>

or:

<p><i>Then </i>S uzy told Bob to jump over the fence.</p>

But if the book is riddled with them, then I make sure to look much more closely (and those typically get caught at other passes, or just write up a custom Regex to catch that error).

Side Note: I don't use the capitals one too often because many of the books I work on have text along these lines: "Product C and Product D" + "Person X and Y".

Quote:
Reversed single and double quotes in nested quotations:
“And I said to him, ‘Quit that!”’
‘“O what a tangled web we weave,’” she said.
This is also a Search/Replace that I use:

Search: ‘“
Replace: “‘

Search: ”’
Replace: ’”

Although use those on a case-by-case basis (don't just do a huge Replace All).

Side Note: Quotation marks typically require some scrutiny, because there are a huge amount of actual book typos that have creeped in due to wrong nesting. As a related note, I found that parenthesis + brackets follow the same rules, and also have a relatively large amount of nesting errors. This was an entire class of errors that I missed until I used Toxaris's "Dialogue Check" (Pure Regex is not as good).

Quote:
’ Right single quote should replace "straight" apostrophe, not ‘ Left single quote. Happens often at start of a word:
‘em should be ’em, ‘tis should be ’tis
This is the Regex I use:

Search: ‘(Em|em|Til|til|Tis|tis|Twas|twas)
Replace: ’\1

Related is the RIGHT single quote before shortened years:

Search: ‘([0-9])
Replace: ’\1

or the RIGHT single quote before + after the "n":

Rock ’n’ Roll

Last edited by Tex2002ans; 07-11-2016 at 05:52 AM.
Tex2002ans is offline   Reply With Quote
Old 07-11-2016, 06:43 AM   #23
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,730
Karma: 24031401
Join Date: Dec 2010
Device: Kindle PW2
Quote:
Originally Posted by GrannyGrump View Post
OCR VILLAINS:
I had a look at the documentation for the Hunspell library, which appears to have been written by a programmer who does his taxes in binary, and found out that it's possible to add custom letter replacements to get betters spelling suggestions.

Replacements need to be defined in the affix file (e.g. en_US.aff for US English), which is a plain text file that can be edited with a programmer's editor, e.g. Notepad ++.

The format is as follows

Code:
REP {number of following entries}
REP {OLD} {NEW}
For example the original replacement section in en_US.aff looks like this:

Code:
REP 94
REP nt n't
...
...
REP shun tion
REP shun sion
REP shun cion
Based on your OCR villains list, I've created a custom list, added it after the last entry and updated the replacement count to REP 127 (94 existing entries + 33 new ones):

Spoiler:
Code:
REP e c
REP c e
REP h b
REP b h
REP H ll
REP H li
REP h li
REP hn lm
REP rn m
REP m rn
REP ri n
REP n ri
REP r f
REP m in
REP in m
REP im un
REP un im
REP n u
REP ii u
REP B R
REP R B
REP F P
REP P F
REP ih th
REP di th
REP tii th
REP tli th
REP Tm "I'm
REP U ll
REP T li
REP T il
REP vv w
REP y v
REP v y


With this change in place, the first suggestion for "ahnost" is no longer stenost, but almost and the suggestion for "hke" is like instead of hike.

If you want to test my modified file:

1. Go to C:\Program Files\Sigil\hunspell_dictionaries
2. Create a backup copy of en_US.aff.
3. Overwrite en_US.aff with the attached version. (You'll need to confirm a system warning.)
Attached Files
File Type: zip en_US.aff.zip (14.6 KB, 362 views)

Last edited by Doitsu; 07-22-2016 at 03:17 AM.
Doitsu is offline   Reply With Quote
Old 07-12-2016, 05:26 AM   #24
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,549
Karma: 19500001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
DP has a list of some words that will not be detected by a spell checker, but are most probably OCR errors (scannos), among them the infamous "arid" (for and) and "modem" (for modern):

http://www.pgdp.net/c/faq/wordcheck-...ite_word_lists
Jellby is offline   Reply With Quote
Old 07-14-2016, 11:58 PM   #25
AlexBell
Wizard
AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.
 
AlexBell's Avatar
 
Posts: 3,413
Karma: 13369310
Join Date: May 2008
Location: Launceston, Tasmania
Device: Sony PRS T3, Kobo Glo, Kindle Touch, iPad, Samsung SB 2 tablet
Thanks, Tex2002an, #22. I'm afraid I haven't kept a record. As I remember many of them were , instead of . and vice versa, and I instead of ! and vice versa. But many of them just shouldn't have been there at all.

The pdf originals from which the ePub files I used were made were of quite poor quality - though that's no excuse.
AlexBell is offline   Reply With Quote
Old 07-15-2016, 04:36 AM   #26
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Jellby View Post
DP has a list of some words that will not be detected by a spell checker, but are most probably OCR errors (scannos), among them the infamous "arid" (for and) and "modem" (for modern):

http://www.pgdp.net/c/faq/wordcheck-...ite_word_lists
Thanks for this link, you always seem to post it, and I always seem to forget about it. I should try to embed this into my brain.

Quote:
Originally Posted by AlexBell View Post
Thanks, Tex2002an, #22. I'm afraid I haven't kept a record. As I remember many of them were , instead of . and vice versa, and I instead of ! and vice versa. But many of them just shouldn't have been there at all.
Ahh, that is too bad. Does nobody else save all the versions of the file as they work on them?

I tend to mark all of my files with [YYYY.MM.DD] and just save them as I go along. Therefore in the future, I could easily use code comparison tools on the EPUBs to see exactly what has changed between versions.

Quote:
Originally Posted by AlexBell View Post
The pdf originals from which the ePub files I used were made were of quite poor quality - though that's no excuse.
Can you link to the Archive.org versions you used + your completed EPUB?

Side Note: Here are a few common OCR errors I ran into tonight:

o£ -> of
tbe -> the
lias -> has

Roman Numeral Problems with the "V" OCRing as "Y":

Chapter XY -> Chapter XV
Chapter Y -> Chapter V
Chapter XYI -> Chapter XVI
CHAPTER XXIY -> CHAPTER XXIV
CHAPTER XXYI -> CHAPTER XXVI

Punctuation Errors (em dash + hyphen):

—- -> —
-— -> —

You may also want to look out for hyphens followed by a space. This needs to be decided on a case-by-case basis, because many of these are valid. Example, "This is a one- or two-hyphen error." In many cases it is either a badly recognized soft hyphen (end of line or end of page), a speck of dust, or an actual OCR error.

You may also want to make a pass looking for <sup> or <sub> tags. Sometimes OCR just goes crazy and inserts this into the text.

Last edited by Tex2002ans; 07-15-2016 at 04:48 AM.
Tex2002ans is offline   Reply With Quote
Old 09-16-2016, 03:01 AM   #27
Golden_Images
Scanning Services
Golden_Images began at the beginning.
 
Posts: 2
Karma: 10
Join Date: May 2014
Location: Missouri
Device: multiple
When proofing against a scanned and converted Word doc try bringing up an image only PDF file on half the screen and the word doc of the other. Then slowly go through and check it against the PDF and apply corrections. When you're finished have another pair of eyes do the same thing. That's how we do it. We call it corrective editing.
Stan
www.pdfdocument.com has more information for those who are interested.
Golden_Images is offline   Reply With Quote
Old 09-18-2016, 12:56 AM   #28
AlexBell
Wizard
AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.AlexBell ought to be getting tired of karma fortunes by now.
 
AlexBell's Avatar
 
Posts: 3,413
Karma: 13369310
Join Date: May 2008
Location: Launceston, Tasmania
Device: Sony PRS T3, Kobo Glo, Kindle Touch, iPad, Samsung SB 2 tablet
Quote:
Originally Posted by Golden_Images View Post
When proofing against a scanned and converted Word doc try bringing up an image only PDF file on half the screen and the word doc of the other. Then slowly go through and check it against the PDF and apply corrections. When you're finished have another pair of eyes do the same thing. That's how we do it. We call it corrective editing.
Stan
www.pdfdocument.com has more information for those who are interested.
Welcome to the forum, and thanks
AlexBell is offline   Reply With Quote
Old 09-25-2016, 03:51 PM   #29
Gregg Bell
Gregg Bell
Gregg Bell ought to be getting tired of karma fortunes by now.Gregg Bell ought to be getting tired of karma fortunes by now.Gregg Bell ought to be getting tired of karma fortunes by now.Gregg Bell ought to be getting tired of karma fortunes by now.Gregg Bell ought to be getting tired of karma fortunes by now.Gregg Bell ought to be getting tired of karma fortunes by now.Gregg Bell ought to be getting tired of karma fortunes by now.Gregg Bell ought to be getting tired of karma fortunes by now.Gregg Bell ought to be getting tired of karma fortunes by now.Gregg Bell ought to be getting tired of karma fortunes by now.Gregg Bell ought to be getting tired of karma fortunes by now.
 
Gregg Bell's Avatar
 
Posts: 2,266
Karma: 3917598
Join Date: Jan 2013
Location: Itasca, Illinois
Device: Kindle Touch 7, Sony PRS300, Fire HD8 Tablet
I'll second the vote for Balbolka. And I don't know if it was mentioned or not but when the word is spoken aloud the text for that word is also highlighted.

Now I use Linux and there is a similar program to Balbolka named Espeak.

I use the LibreOffice spell checker but I also find it helpful to borrow a Windows computer and use the Word spell checker. (I find that the Word spell (and grammar) checker catches things Libreoffice doesn't like):

John went to the store fro a gallon of milk.

Last edited by Gregg Bell; 09-25-2016 at 03:56 PM.
Gregg Bell is offline   Reply With Quote
Old 09-25-2016, 05:37 PM   #30
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,730
Karma: 24031401
Join Date: Dec 2010
Device: Kindle PW2
Quote:
Originally Posted by Gregg Bell View Post
I use the LibreOffice spell checker but I also find it helpful to borrow a Windows computer and use the Word spell checker. (I find that the Word spell (and grammar) checker catches things Libreoffice doesn't like):
By default, LibreOffice comes only with a basic spell checker. You might want to install the LanguageTool extension.

If you check your sample sentence with it, you'll get the following error message:

Quote:
Originally Posted by Gregg Bell View Post
John went to the store fro a gallon of milk.
Did you mean "for" or "from"?
Doitsu is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Tools and methodology for easier proof-reading Iznogood Workshop 23 12-05-2016 10:43 AM
ABBYY FineReader - Proof reading tips? PieOPah Workshop 23 03-02-2012 01:03 AM
Proof reading: What do you do when you find a clear misprint? graycyn Workshop 4 07-20-2011 01:13 PM
Calibre Book Reader for Proof Reading/Editing Agama Calibre 16 05-10-2011 05:08 PM
Proof Reading Service genepool General Discussions 1 03-16-2011 09:02 AM


All times are GMT -4. The time now is 06:10 PM.


MobileRead.com is a privately owned, operated and funded community.