View Single Post
Old 11-05-2018, 06:33 AM   #179
carmenchu
Groupie
carmenchu ought to be getting tired of karma fortunes by now.carmenchu ought to be getting tired of karma fortunes by now.carmenchu ought to be getting tired of karma fortunes by now.carmenchu ought to be getting tired of karma fortunes by now.carmenchu ought to be getting tired of karma fortunes by now.carmenchu ought to be getting tired of karma fortunes by now.carmenchu ought to be getting tired of karma fortunes by now.carmenchu ought to be getting tired of karma fortunes by now.carmenchu ought to be getting tired of karma fortunes by now.carmenchu ought to be getting tired of karma fortunes by now.carmenchu ought to be getting tired of karma fortunes by now.
 
Posts: 183
Karma: 266070
Join Date: Dec 2010
Device: Win7,Win10,Lubuntu,smartphone
Trouble with hyphens

Quote:
Originally Posted by CalibUser View Post
Hi,

I have developed this plugin as a tool to help tidy up ePub files that have been converted from pdf documents but contain ocr errors.

EDIT The plugin has been updated to version 2.0.0.5B. This provides the facility to scroll the span tag window.

The instructions for using the plugin are in the attached file named ePub tidy tool v0.2.0.0.5A.epub.

Important: Please ensure that you keep a back up of your original ePub file before running this plugin.

When some old publications are OCR'd some words are frequently misspelt in the same way in every scan. I am attaching a file that can be used with the plugin to correct the spelling of these words. It is based on a file provided by martyger at https://www.mobileread.com/forums/sh...d.php?t=265830 and includes updates form Steadyhands at https://www.mobileread.com/forums/sh...&postcount=154

Gipsy has put files containing Greek words for this plugin in this thread at:
https://www.mobileread.com/forums/sh...65#post3208365



Enjoy!
Hello!
First, enormous thanks for implementing a functionality I was wishing for (I overlooked the plugin for a time through associating ePubTidy with HTMLTidy )
Now. I have a have hit a snag on running the plugin: on the list of
Quote:
proposed changes--remove hyphen
I found , i.e.:
cow-puncher --> cowpuncher
to-day --> today
But, looking in wikidictionary, both the words on the right have entries, the first as 'alternate spelling', the second as 'old form of' (the book is from the 1910's).
Thus being the case, I discarded the changes--which discards also everything else.
Would it be possible to either
-- make the hyphen changes optional
-- or allow for a 'custom exception list' of hyphens to keep?
I may mention that on editing OCR files without images of the original text, I use to deal with cases as above, by mimicking Kovid's heuristics: count occurences of with/without hyphen...
Thanks again for the plugin!
carmenchu is offline   Reply With Quote