|
De-hyphenator
I have been reading some older books lately, and they were obviously scanned using OCR. Although there aren't a lot of problems, there are quite a few hard hyphens that were obviously from end-of-line hyphenation.
I know the logic I want to use to find these extra hyphens, but have no idea how to actually implement it.
My thought is to search for something like [ “‘’]([a-z]+)-([a-z]+)[ .,!?”’:;/-] and then feed the two tagged sequences to spellcheck. If they are both spelled correctly, skip. If not, or if the concatenation of the two tagged sequences is spelled correctly, add this match to a list that would be displayed like spellcheck does.
The logic is that a "proper" hyphen in an EPUB source will have both sides being words, and it would be very unlikely that removing the hyphen would result in a correctly spelled word. There would still be a few false positives in the list (mostly where the hyphenation is just an alternate spelling), and it would miss some compound words that just happened to be at the end of a line and got split to the next line (like "house-boat"), but it should still help.
The search part is easy...it's the "feed it to the spell checker" that I have no clue how to accomplish.
I think something like this would also be of use to users are doing their own scanning and OCR for books they own.
|