09-01-2013, 06:03 AM | #1 |
Junior Member
Posts: 6
Karma: 50654
Join Date: Sep 2013
Device: Android tablet (Coolreader app)
|
Epub spell checker
Hi,
I noticed a lot of ebooks I downloaded still contain OCR errors and other spelling mistakes (hyphens still present, missing spaces, etc.) and I couldn't really find a good way to remove those errors in bulk. I tried in Sigil but that took a long time, so I decided to write my own tool to speed up correcting things. I've released it as open source on https://epubspellchecker.codeplex.com/ Don't hope for 1-click-fix-everything button though, each entry will still have to be reviewed manually, but it's a lot faster to hit 'space' to use the suggested correction than to manually review each sentence in Sigil. [Image violates Posting Guidelines for size - MODERATOR] Any suggestions or remarks are welcome Last edited by Dr. Drib; 10-04-2017 at 09:44 AM. |
09-02-2013, 03:10 AM | #2 |
Grand Sorcerer
Posts: 5,582
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
|
Thanks for the developing this software. I gave it a spin yesterday and found the concept quite interesting, because none of the spell checkers that I know uses word frequencies to suggest spelling alternatives.
I'd be nice if you could make the types of warnings/errors being displayed user-selectable. For example, in my pdf, the Unneeded hyphen warning was 99% of the time wrong and I'd like to be able to hide those suggestions. Also the software often thought that uncommon plural forms ending in s where errors. Maybe you could add an additional check to your algorithm that will try to find the word without the final s in the dictionary, before flagging it as an error. It'd also be nice to have an option to add document/font specific suggestions. For example, in the document that I'm working on "ll" was often incorrectly recognized as "U". If the spell checker "knew" that, it could make more meaningful suggestions. And finally, it'd be nice if the text displayed in the KWIC window could be copied to the clipboard. (Currently it's read-only.) |
09-02-2013, 03:59 AM | #3 |
Junior Member
Posts: 6
Karma: 50654
Join Date: Sep 2013
Device: Android tablet (Coolreader app)
|
Thanks for the feedback, I'll add a preferences dialog where settings can be changed.
If you look into the folder you'll find an ocrpatterns.txt file, where I put OCR patterns in. I had an e-book where the pattern U -> li occurred a lot, which is probably the same pattern you had issues with. You can add new patterns or remove ones that are obstructing, I'll try to streamline this as well. Edit: I've made a new release: - Changed high probability test to be more useful - Added preferences dialog where you can enable/disable tests and some other settings - Added additional suffix test (words ending with -s, -ing) - Backbuffered occurence list to reduce flicker - Entries with html encoded href like %20 would give an error - Space hotkey added on the list of occurences to quickly toggle ignore of that occurrence - Statistics of word occurences in status bar didn't hold partial ignore in account - Ctrl+C shortcut for copying the selected line Last edited by drake7707; 09-02-2013 at 09:12 AM. |
09-02-2013, 12:39 PM | #4 |
Grand Sorcerer
Posts: 5,582
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
|
Thanks for implementing some of my suggestions.
I just checked an epub with it that I already spell-checked with Sigil and found a handful of errors that Sigil missed. All this tool now needs is a simple one .html page online help file for those who don't alway have access to the Internet. Also don't forget to mention in the help that ocrpatterns.txt can be edited, because it's not exposed via the UI. I almost missed this one, because I'm used to right-click menu items, but CTRL+C works, too. |
09-14-2013, 05:31 AM | #5 |
Guru
Posts: 691
Karma: 3026110
Join Date: Dec 2008
Location: Lancashire, U.K.
Device: BeBook 1, BeBook Pure, Kobo Glo, (and HD),Energy Sistem EReader Pro +
|
This looks very useful - I tried to launch it from within Calibre (using the "Open With" add-on) but it seems to complain about the lack of a dictionary (no complaints when opening it normally). It also ignores the current epub file and opens with no file loaded.
This is not a complaint as I'm trying to make it work in a way not intended but I would probably use it more if it could be launched in this way. I do like the American/British English dictionary supplied - I prefer British English but a lot of the books I work on have been Americanised. BobC |
10-03-2013, 11:14 AM | #6 |
Addict
Posts: 265
Karma: 724240
Join Date: Aug 2013
Device: KyBook
|
I like the initiative but I think there is still room for improvement. It deemed a lot of perfectly valid spelled words a 'Possible OCR error'. I have totally no idea if this is possible but I think you could reduce this a lot by looking at the context the word was used in. Not every h is always the result of an OCR errored 'b' nor vice-versa, same for the 'l' and 'f' versus 't'.
I also found, by accident, that it seems to suffer from OCR blindness itself too. I was trying it out on "Three Men in a Boat" from Jerome K. Jerome, and there is a sentence in the ePub going as follows: "Harris, in moving about, trod on George’s corn." Epub spellchecker actually read the corn as "com" and also flagged it as such. AND gave the same word as suggested replacement. (see attachment) Also, about the unneeded hyphens, not being a native English speaker I would need to study up on when and where they are normally used but I am almost sure there are words requiring them. I have not yet looked too deep but I assume there is some kind of exception list someplace so that not everything containing hyphens is flagged as such. EDIT: PS. It may have not had this when you started ePub checker, but the current Sigil build has a similar approach option as yours. If using the Spellcheck button you get like in ePub spellchecker, a list with deemed misspellings plus frequency counts and similarly like in ePub, you can have all occurrences replaced at once but not for every 'misspelling' at once, which may be a bit too aggressive because you always will need to revise the list to make sure it only replaces true misspellings. So in the end you are still spending the same amount of time. But that aside. Yours does offer more information in that it tries to categorize the type of spelling errors AND more importantly it shows the context. Suggestion: Extend the Options filter to include all types of possible errors so that you can filter on each category separately instead of on "Show only errors & warnings" (and also of course have the 'Copy all suggestions ...' then only affect the filtered list) Suggestion2: About the context preview. Would be cool if it could show the rendered version instead of the html code itself. Also there seems to be some extra useless space inserted before the bolded misspelling. Don't think you need that to accentuate the misspelling if you already bold the word. Last edited by At_Libitum; 10-03-2013 at 12:11 PM. |
10-05-2013, 03:33 AM | #7 | |
Junior Member
Posts: 6
Karma: 50654
Join Date: Sep 2013
Device: Android tablet (Coolreader app)
|
Quote:
You can swap out the dictionary.txt in the program folder with a british version if you want. I've combined a few dictionaries I found online until I had a reasonable database. The txt file is a simple word on each line text file. |
|
10-05-2013, 03:47 AM | #8 | ||||||
Junior Member
Posts: 6
Karma: 50654
Join Date: Sep 2013
Device: Android tablet (Coolreader app)
|
Quote:
Quote:
Edit: oh wait, I read this wrong. I'll check it out, it might just be a font rendering issue though so rn looks like m. Quote:
Quote:
Quote:
Quote:
Last edited by drake7707; 10-05-2013 at 04:43 AM. |
||||||
10-24-2013, 10:15 PM | #9 | |
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
I downloaded your program the day it was posted, and I just used it for the first time today on one EPUB.
Quote:
ONE I wanted to point to something which might be helpful with hyphenations: In English, there are many Prefixes: https://en.wikipedia.org/wiki/English_prefixes Currently, your program marks all of these as "Unneeded hyphen". Perhaps hyphened words that start with these can be marked with a "Prefix" class instead. TWO Since you use frequencies, you should DEFINITELY mark (in a different color if possible) if both hyphneated and non-hyphenated versions of a word exist in a book at the same time: "step-father" + "stepfather" "mis-information" + "misinformation" "business-man" + "businessman" "life-like" + "lifelike" [...] As you stated, each book might hyphenate or not hyphenate these words, but it is almost always an error when they are mix and matched. THREE Throughout my EPUBs, there are a massive amount of page numbers (not to mention an Index). Your program marks down all of these hyphenated numbers and clutters the list: "97-98" -> "p. 97-98" "127-28" -> "pp. 121, 127-28, 185" Also, numbers might be separated by an en dash instead of hyphen. Perhaps these can be marked under the "Number" category as well. BUGS: "self-" is definitely missing from your hyphenations (your current program marked these as "Missing spaces"). So adding in those Prefixes should help fix many of these "Missing spaces" errors. Your program said "thought1" was misspelled: Actual Code: Code:
<p>Marshall in this regard makes his own thought<sup>1</sup> entirely clear:</p> Code:
<p>Marshall in this regard makes his own thought1</sup> entirely clear:</p>
Actual Code: Code:
<p>And further, in a note on the same pages: “Then p<sub>1</sub> p<sub>2</sub> . . . p<sub>8</sub> are points on his demand curve for tea; . . .” [...] Code:
<p>And further, in a note on the same pages: “Then p<sub>1</sub> p<sub>2</sub> . . . p8</sub> are points on his demand curve for tea; . . .” [...]
I will definitely be posting more errors as I find them. Side Note: Should this be in the EPUB forum instead of "Reading and Management"? Last edited by Tex2002ans; 10-24-2013 at 10:17 PM. |
|
10-26-2013, 06:09 PM | #10 |
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Forgive the double-post, but another suggestion just popped into my head.
Along with the "nonhyphenated" + "non-hyphenated" words potentially being a mistake, accented words should also be compared with their non-accented versions: "résumé" + "resume" "coöperation" + "cooperation" "rôle" + "role" Usually the book sticks with one style, and when these are mix and matched, the OCR has messed up (or even the original book mistakenly forgot to add accents in some places). |
05-03-2014, 03:01 AM | #11 | |
Junior Member
Posts: 6
Karma: 50654
Join Date: Sep 2013
Device: Android tablet (Coolreader app)
|
I've created a new release that addresses some of those issues described above. I'm not entirely sure how to tackle the hyphenation issues with all the prefixes yet, so that's not included.
Quote:
|
|
09-29-2017, 11:49 AM | #12 | |
Enthusiast
Posts: 37
Karma: 10
Join Date: Jul 2014
Device: Kobo Mini
|
Quote:
Anyways I found the project on github: https://github.com/drake7707/epubspellchecker Thanks for releasing the source! Can you add license please, so that people could file pull requests etc? |
|
10-06-2017, 09:38 AM | #13 | |
Junior Member
Posts: 6
Karma: 50654
Join Date: Sep 2013
Device: Android tablet (Coolreader app)
|
Quote:
|
|
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Sigil Spell Checker Coming~ soon? | Themus | Sigil | 49 | 12-11-2011 02:33 PM |
epub checker | drMerry | Development | 3 | 06-17-2011 02:04 PM |
epub validation checker | bobcdy | ePub | 0 | 06-03-2011 05:31 PM |
Spell checker | crutledge | Sigil | 31 | 12-29-2010 01:31 PM |
iTunes ePub Checker | Bull06 | Calibre | 3 | 08-26-2010 12:09 PM |