MobileRead Forums - View Single Post

drake7707 · 10-05-2013, 03:47 AM

Quote:

Originally Posted by At_Libitum

I like the initiative but I think there is still room for improvement. It deemed a lot of perfectly valid spelled words a 'Possible OCR error'. I have totally no idea if this is possible but I think you could reduce this a lot by looking at the context the word was used in. Not every h is always the result of an OCR errored 'b' nor vice-versa, same for the 'l' and 'f' versus 't'.

The problem is that I don't have the additional info available in the current dictionary.txt file. I don't know if a word is a verb or adjective etc. If I had that info I could maybe reduce the number of false positives.

Quote:

Originally Posted by At_Libitum

I also found, by accident, that it seems to suffer from OCR blindness itself too. I was trying it out on "Three Men in a Boat" from Jerome K. Jerome, and there is a sentence in the ePub going as follows:

"Harris, in moving about, trod on George’s corn."

Epub spellchecker actually read the corn as "com" and also flagged it as such. AND gave the same word as suggested replacement.
(see attachment)

Yes, this is intentional, albeit annoying if it occurs. I had a lot of OCR errors in the books I tested that were valid words but were still wrong in context. That's why I check all the valid words as well if any OCR patterns applied on it change it to words that also occur in the book. The only difference between 'Probable OCR error' and 'Possible OCR error' is that the former means the OCR pattern has been applied before for building suggestions on words that weren't recognized. You can turn this behaviour off in the options though.

Edit: oh wait, I read this wrong. I'll check it out, it might just be a font rendering issue though so rn looks like m.

Quote:

Originally Posted by At_Libitum

Also, about the unneeded hyphens, not being a native English speaker I would need to study up on when and where they are normally used but I am almost sure there are words requiring them. I have not yet looked too deep but I assume there is some kind of exception list someplace so that not everything containing hyphens is flagged as such.

Not being a native English speaker myself I was hoping that hyphens were included in the dictionary.txt file and thus not flagged as 'Unnecessary hyphens'. This is one I have difficulty with when correcting books because I don't know the spelling of most of those hyphened words (and also seem to vary on a book by book basis).

Quote:

Originally Posted by At_Libitum

EDIT: PS. It may have not had this when you started ePub checker, but the current Sigil build has a similar approach option as yours. If using the Spellcheck button you get like in ePub spellchecker, a list with deemed misspellings plus frequency counts and similarly like in ePub, you can have all occurrences replaced at once but not for every 'misspelling' at once, which may be a bit too aggressive because you always will need to revise the list to make sure it only replaces true misspellings. So in the end you are still spending the same amount of time. But that aside. Yours does offer more information in that it tries to categorize the type of spelling errors AND more importantly it shows the context.

It had, my spell checking thing is relatively new

. You can exclude lines to correct though. If you select lines in the occurrence list and click the ignore button (right of it), you'll see that the selected occurrences are greyed out and won't be corrected. The classic example is "die" vs "the", a lot of times die should be the, the other occurrences where die is valid you can just grey them out while keeping the die -> the fixed text.

Quote:

Originally Posted by At_Libitum

Suggestion: Extend the Options filter to include all types of possible errors so that you can filter on each category separately instead of on "Show only errors & warnings" (and also of course have the 'Copy all suggestions ...' then only affect the filtered list)

I'll probably add more filters, it gets a bit cluttered now.

Quote:

Originally Posted by At_Libitum

Suggestion2: About the context preview. Would be cool if it could show the rendered version instead of the html code itself. Also there seems to be some extra useless space inserted before the bolded misspelling. Don't think you need that to accentuate the misspelling if you already bold the word.

I'll try, but currently I don't parse any html at all. I ignore all tags, replace the escaped characters (like &quot

with their unescaped form and then start tokenizing words. I'll see if I can find a library that can show html while retaining the current functionality.