Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Reading and Management

Notices

Reply
 
Thread Tools Search this Thread
Old 09-01-2013, 06:03 AM   #1
drake7707
Junior Member
drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!
 
Posts: 6
Karma: 50654
Join Date: Sep 2013
Device: Android tablet (Coolreader app)
Epub spell checker

Hi,

I noticed a lot of ebooks I downloaded still contain OCR errors and other spelling mistakes (hyphens still present, missing spaces, etc.) and I couldn't really find a good way to remove those errors in bulk. I tried in Sigil but that took a long time, so I decided to write my own tool to speed up correcting things.

I've released it as open source on https://epubspellchecker.codeplex.com/

Don't hope for 1-click-fix-everything button though, each entry will still have to be reviewed manually, but it's a lot faster to hit 'space' to use the suggested correction than to manually review each sentence in Sigil.

[Image violates Posting Guidelines for size - MODERATOR]

Any suggestions or remarks are welcome

Last edited by Dr. Drib; 10-04-2017 at 09:44 AM.
drake7707 is offline   Reply With Quote
Old 09-02-2013, 03:10 AM   #2
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,582
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
Thanks for the developing this software. I gave it a spin yesterday and found the concept quite interesting, because none of the spell checkers that I know uses word frequencies to suggest spelling alternatives.

I'd be nice if you could make the types of warnings/errors being displayed user-selectable. For example, in my pdf, the Unneeded hyphen warning was 99% of the time wrong and I'd like to be able to hide those suggestions. Also the software often thought that uncommon plural forms ending in s where errors. Maybe you could add an additional check to your algorithm that will try to find the word without the final s in the dictionary, before flagging it as an error.

It'd also be nice to have an option to add document/font specific suggestions. For example, in the document that I'm working on "ll" was often incorrectly recognized as "U". If the spell checker "knew" that, it could make more meaningful suggestions.

And finally, it'd be nice if the text displayed in the KWIC window could be copied to the clipboard. (Currently it's read-only.)
Doitsu is offline   Reply With Quote
Old 09-02-2013, 03:59 AM   #3
drake7707
Junior Member
drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!
 
Posts: 6
Karma: 50654
Join Date: Sep 2013
Device: Android tablet (Coolreader app)
Thanks for the feedback, I'll add a preferences dialog where settings can be changed.

If you look into the folder you'll find an ocrpatterns.txt file, where I put OCR patterns in. I had an e-book where the pattern U -> li occurred a lot, which is probably the same pattern you had issues with. You can add new patterns or remove ones that are obstructing, I'll try to streamline this as well.


Edit:

I've made a new release:

- Changed high probability test to be more useful
- Added preferences dialog where you can enable/disable tests and some other settings
- Added additional suffix test (words ending with -s, -ing)
- Backbuffered occurence list to reduce flicker
- Entries with html encoded href like %20 would give an error
- Space hotkey added on the list of occurences to quickly toggle ignore of that occurrence
- Statistics of word occurences in status bar didn't hold partial ignore in account
- Ctrl+C shortcut for copying the selected line

Last edited by drake7707; 09-02-2013 at 09:12 AM.
drake7707 is offline   Reply With Quote
Old 09-02-2013, 12:39 PM   #4
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,582
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
Thanks for implementing some of my suggestions.



I just checked an epub with it that I already spell-checked with Sigil and found a handful of errors that Sigil missed.
All this tool now needs is a simple one .html page online help file for those who don't alway have access to the Internet.
Also don't forget to mention in the help that ocrpatterns.txt can be edited, because it's not exposed via the UI.

Quote:
Originally Posted by drake7707 View Post
...
- Ctrl+C shortcut for copying the selected line
I almost missed this one, because I'm used to right-click menu items, but CTRL+C works, too.
Doitsu is offline   Reply With Quote
Old 09-14-2013, 05:31 AM   #5
BobC
Guru
BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.
 
Posts: 691
Karma: 3026110
Join Date: Dec 2008
Location: Lancashire, U.K.
Device: BeBook 1, BeBook Pure, Kobo Glo, (and HD),Energy Sistem EReader Pro +
This looks very useful - I tried to launch it from within Calibre (using the "Open With" add-on) but it seems to complain about the lack of a dictionary (no complaints when opening it normally). It also ignores the current epub file and opens with no file loaded.

This is not a complaint as I'm trying to make it work in a way not intended but I would probably use it more if it could be launched in this way.

I do like the American/British English dictionary supplied - I prefer British English but a lot of the books I work on have been Americanised.

BobC
BobC is offline   Reply With Quote
Old 10-03-2013, 11:14 AM   #6
At_Libitum
Addict
At_Libitum ought to be getting tired of karma fortunes by now.At_Libitum ought to be getting tired of karma fortunes by now.At_Libitum ought to be getting tired of karma fortunes by now.At_Libitum ought to be getting tired of karma fortunes by now.At_Libitum ought to be getting tired of karma fortunes by now.At_Libitum ought to be getting tired of karma fortunes by now.At_Libitum ought to be getting tired of karma fortunes by now.At_Libitum ought to be getting tired of karma fortunes by now.At_Libitum ought to be getting tired of karma fortunes by now.At_Libitum ought to be getting tired of karma fortunes by now.At_Libitum ought to be getting tired of karma fortunes by now.
 
Posts: 265
Karma: 724240
Join Date: Aug 2013
Device: KyBook
I like the initiative but I think there is still room for improvement. It deemed a lot of perfectly valid spelled words a 'Possible OCR error'. I have totally no idea if this is possible but I think you could reduce this a lot by looking at the context the word was used in. Not every h is always the result of an OCR errored 'b' nor vice-versa, same for the 'l' and 'f' versus 't'.

I also found, by accident, that it seems to suffer from OCR blindness itself too. I was trying it out on "Three Men in a Boat" from Jerome K. Jerome, and there is a sentence in the ePub going as follows:

"Harris, in moving about, trod on George’s corn."

Epub spellchecker actually read the corn as "com" and also flagged it as such. AND gave the same word as suggested replacement.
(see attachment)

Also, about the unneeded hyphens, not being a native English speaker I would need to study up on when and where they are normally used but I am almost sure there are words requiring them. I have not yet looked too deep but I assume there is some kind of exception list someplace so that not everything containing hyphens is flagged as such.

EDIT: PS. It may have not had this when you started ePub checker, but the current Sigil build has a similar approach option as yours. If using the Spellcheck button you get like in ePub spellchecker, a list with deemed misspellings plus frequency counts and similarly like in ePub, you can have all occurrences replaced at once but not for every 'misspelling' at once, which may be a bit too aggressive because you always will need to revise the list to make sure it only replaces true misspellings. So in the end you are still spending the same amount of time. But that aside. Yours does offer more information in that it tries to categorize the type of spelling errors AND more importantly it shows the context.

Suggestion: Extend the Options filter to include all types of possible errors so that you can filter on each category separately instead of on "Show only errors & warnings" (and also of course have the 'Copy all suggestions ...' then only affect the filtered list)

Suggestion2: About the context preview. Would be cool if it could show the rendered version instead of the html code itself. Also there seems to be some extra useless space inserted before the bolded misspelling. Don't think you need that to accentuate the misspelling if you already bold the word.
Attached Thumbnails
Click image for larger version

Name:	funny-non-error.png
Views:	591
Size:	5.1 KB
ID:	112712  

Last edited by At_Libitum; 10-03-2013 at 12:11 PM.
At_Libitum is offline   Reply With Quote
Old 10-05-2013, 03:33 AM   #7
drake7707
Junior Member
drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!
 
Posts: 6
Karma: 50654
Join Date: Sep 2013
Device: Android tablet (Coolreader app)
Quote:
Originally Posted by BobC View Post
This looks very useful - I tried to launch it from within Calibre (using the "Open With" add-on) but it seems to complain about the lack of a dictionary (no complaints when opening it normally). It also ignores the current epub file and opens with no file loaded.

This is not a complaint as I'm trying to make it work in a way not intended but I would probably use it more if it could be launched in this way.

I do like the American/British English dictionary supplied - I prefer British English but a lot of the books I work on have been Americanised.

BobC
It probably passes the epub path as a first argument in the command line, I'll add that. I'll also check for the dictionary.txt by looking at the exe path rather than the working directory. It's highly likely the working directory would be the Calibre exe folder if it's started from inside calibre.

You can swap out the dictionary.txt in the program folder with a british version if you want. I've combined a few dictionaries I found online until I had a reasonable database. The txt file is a simple word on each line text file.
drake7707 is offline   Reply With Quote
Old 10-05-2013, 03:47 AM   #8
drake7707
Junior Member
drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!
 
Posts: 6
Karma: 50654
Join Date: Sep 2013
Device: Android tablet (Coolreader app)
Quote:
Originally Posted by At_Libitum View Post
I like the initiative but I think there is still room for improvement. It deemed a lot of perfectly valid spelled words a 'Possible OCR error'. I have totally no idea if this is possible but I think you could reduce this a lot by looking at the context the word was used in. Not every h is always the result of an OCR errored 'b' nor vice-versa, same for the 'l' and 'f' versus 't'.
The problem is that I don't have the additional info available in the current dictionary.txt file. I don't know if a word is a verb or adjective etc. If I had that info I could maybe reduce the number of false positives.

Quote:
Originally Posted by At_Libitum View Post
I also found, by accident, that it seems to suffer from OCR blindness itself too. I was trying it out on "Three Men in a Boat" from Jerome K. Jerome, and there is a sentence in the ePub going as follows:

"Harris, in moving about, trod on George’s corn."

Epub spellchecker actually read the corn as "com" and also flagged it as such. AND gave the same word as suggested replacement.
(see attachment)
Yes, this is intentional, albeit annoying if it occurs. I had a lot of OCR errors in the books I tested that were valid words but were still wrong in context. That's why I check all the valid words as well if any OCR patterns applied on it change it to words that also occur in the book. The only difference between 'Probable OCR error' and 'Possible OCR error' is that the former means the OCR pattern has been applied before for building suggestions on words that weren't recognized. You can turn this behaviour off in the options though.

Edit: oh wait, I read this wrong. I'll check it out, it might just be a font rendering issue though so rn looks like m.

Quote:
Originally Posted by At_Libitum View Post
Also, about the unneeded hyphens, not being a native English speaker I would need to study up on when and where they are normally used but I am almost sure there are words requiring them. I have not yet looked too deep but I assume there is some kind of exception list someplace so that not everything containing hyphens is flagged as such.
Not being a native English speaker myself I was hoping that hyphens were included in the dictionary.txt file and thus not flagged as 'Unnecessary hyphens'. This is one I have difficulty with when correcting books because I don't know the spelling of most of those hyphened words (and also seem to vary on a book by book basis).

Quote:
Originally Posted by At_Libitum View Post
EDIT: PS. It may have not had this when you started ePub checker, but the current Sigil build has a similar approach option as yours. If using the Spellcheck button you get like in ePub spellchecker, a list with deemed misspellings plus frequency counts and similarly like in ePub, you can have all occurrences replaced at once but not for every 'misspelling' at once, which may be a bit too aggressive because you always will need to revise the list to make sure it only replaces true misspellings. So in the end you are still spending the same amount of time. But that aside. Yours does offer more information in that it tries to categorize the type of spelling errors AND more importantly it shows the context.
It had, my spell checking thing is relatively new . You can exclude lines to correct though. If you select lines in the occurrence list and click the ignore button (right of it), you'll see that the selected occurrences are greyed out and won't be corrected. The classic example is "die" vs "the", a lot of times die should be the, the other occurrences where die is valid you can just grey them out while keeping the die -> the fixed text.

Quote:
Originally Posted by At_Libitum View Post
Suggestion: Extend the Options filter to include all types of possible errors so that you can filter on each category separately instead of on "Show only errors & warnings" (and also of course have the 'Copy all suggestions ...' then only affect the filtered list)
I'll probably add more filters, it gets a bit cluttered now.

Quote:
Originally Posted by At_Libitum View Post
Suggestion2: About the context preview. Would be cool if it could show the rendered version instead of the html code itself. Also there seems to be some extra useless space inserted before the bolded misspelling. Don't think you need that to accentuate the misspelling if you already bold the word.
I'll try, but currently I don't parse any html at all. I ignore all tags, replace the escaped characters (like &quot with their unescaped form and then start tokenizing words. I'll see if I can find a library that can show html while retaining the current functionality.

Last edited by drake7707; 10-05-2013 at 04:43 AM.
drake7707 is offline   Reply With Quote
Old 10-24-2013, 10:15 PM   #9
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
I downloaded your program the day it was posted, and I just used it for the first time today on one EPUB.

Quote:
Originally Posted by drake7707 View Post
Not being a native English speaker myself I was hoping that hyphens were included in the dictionary.txt file and thus not flagged as 'Unnecessary hyphens'. This is one I have difficulty with when correcting books because I don't know the spelling of most of those hyphened words (and also seem to vary on a book by book basis).
SUGGESTIONS:

ONE

I wanted to point to something which might be helpful with hyphenations:

In English, there are many Prefixes: https://en.wikipedia.org/wiki/English_prefixes

Currently, your program marks all of these as "Unneeded hyphen". Perhaps hyphened words that start with these can be marked with a "Prefix" class instead.

TWO

Since you use frequencies, you should DEFINITELY mark (in a different color if possible) if both hyphneated and non-hyphenated versions of a word exist in a book at the same time:

"step-father" + "stepfather"
"mis-information" + "misinformation"
"business-man" + "businessman"
"life-like" + "lifelike"
[...]

As you stated, each book might hyphenate or not hyphenate these words, but it is almost always an error when they are mix and matched.

THREE

Throughout my EPUBs, there are a massive amount of page numbers (not to mention an Index). Your program marks down all of these hyphenated numbers and clutters the list:

"97-98" -> "p. 97-98"
"127-28" -> "pp. 121, 127-28, 185"

Also, numbers might be separated by an en dash instead of hyphen.

Perhaps these can be marked under the "Number" category as well.

BUGS:

"self-" is definitely missing from your hyphenations (your current program marked these as "Missing spaces"). So adding in those Prefixes should help fix many of these "Missing spaces" errors.

Your program said "thought1" was misspelled:

Actual Code:

Code:
  <p>Marshall in this regard makes his own thought<sup>1</sup> entirely clear:</p>
Code as it appears in your program:

Code:
  <p>Marshall in this regard makes his own thought1</sup> entirely clear:</p>
Your program said "p8" was misspelled:

Actual Code:

Code:
  <p>And further, in a note on the same pages: “Then p<sub>1</sub> p<sub>2</sub> . . . p<sub>8</sub> are points on his demand curve for tea; . . .” [...]
Code as it appears in your program:

Code:
  <p>And further, in a note on the same pages: “Then p<sub>1</sub> p<sub>2</sub> . . . p8</sub> are points on his demand curve for tea; . . .” [...]
Perhaps superscript and subscript errors could be treated slightly differently.

I will definitely be posting more errors as I find them.

Side Note: Should this be in the EPUB forum instead of "Reading and Management"?

Last edited by Tex2002ans; 10-24-2013 at 10:17 PM.
Tex2002ans is offline   Reply With Quote
Old 10-26-2013, 06:09 PM   #10
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Forgive the double-post, but another suggestion just popped into my head.

Along with the "nonhyphenated" + "non-hyphenated" words potentially being a mistake, accented words should also be compared with their non-accented versions:

"résumé" + "resume"
"coöperation" + "cooperation"
"rôle" + "role"

Usually the book sticks with one style, and when these are mix and matched, the OCR has messed up (or even the original book mistakenly forgot to add accents in some places).
Tex2002ans is offline   Reply With Quote
Old 05-03-2014, 03:01 AM   #11
drake7707
Junior Member
drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!
 
Posts: 6
Karma: 50654
Join Date: Sep 2013
Device: Android tablet (Coolreader app)
I've created a new release that addresses some of those issues described above. I'm not entirely sure how to tackle the hyphenation issues with all the prefixes yet, so that's not included.

Quote:
- Added unnecssary diacritics test (if OCR introduced accents and umlauts etc)
- Handled numbers in hyphenation (like 98-99, 5-6 etc), those are almost always page numbers and never an error
- Handled subscript and superscript so it doesn't see it as 1 word
- Open epub files from command line (so you can do open with ...)
drake7707 is offline   Reply With Quote
Old 09-29-2017, 11:49 AM   #12
Namenlos
Enthusiast
Namenlos began at the beginning.
 
Posts: 37
Karma: 10
Join Date: Jul 2014
Device: Kobo Mini
Quote:
Originally Posted by Tex2002ans View Post
[…]Along with the "nonhyphenated" + "non-hyphenated" words potentially being a mistake, accented words should also be compared with their non-accented versions:

"résumé" + "resume"
"coöperation" + "cooperation"
"rôle" + "role"

Usually the book sticks with one style, and when these are mix and matched, the OCR has messed up (or even the original book mistakenly forgot to add accents in some places).
This leads to problems in German as hatten/hätten (möchte/mochte) are different very common words that are both in my dictionary.txt and they get marked as "Unnecessary diacritics".

Anyways I found the project on github: https://github.com/drake7707/epubspellchecker Thanks for releasing the source! Can you add license please, so that people could file pull requests etc?
Namenlos is offline   Reply With Quote
Old 10-06-2017, 09:38 AM   #13
drake7707
Junior Member
drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!drake7707 is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!
 
Posts: 6
Karma: 50654
Join Date: Sep 2013
Device: Android tablet (Coolreader app)
Quote:
Originally Posted by Namenlos View Post
Anyways I found the project on github: https://github.com/drake7707/epubspellchecker Thanks for releasing the source! Can you add license please, so that people could file pull requests etc?
I moved it as codeplex is shutting down. Seems I forgot to add the MIT license, it's updated now.
drake7707 is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Sigil Spell Checker Coming~ soon? Themus Sigil 49 12-11-2011 02:33 PM
epub checker drMerry Development 3 06-17-2011 02:04 PM
epub validation checker bobcdy ePub 0 06-03-2011 05:31 PM
Spell checker crutledge Sigil 31 12-29-2010 01:31 PM
iTunes ePub Checker Bull06 Calibre 3 08-26-2010 12:09 PM


All times are GMT -4. The time now is 04:08 AM.


MobileRead.com is a privately owned, operated and funded community.