View Single Post
Old 10-10-2015, 07:07 AM   #98
CalibUser
Addict
CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.
 
Posts: 203
Karma: 62362
Join Date: Jul 2015
Device: Sony
Quote:
Originally Posted by martyger View Post
CalibUser,

...sometimes an OCR will miss periods at the end of a paragraph or add spurious lowercase letters to the end of sentences -- the correct fix is to add a period or delete the character...*not* to join paragraphs. Also, many words (like arid/and, modem/modern, etc) may or may not be errors -- the user needs to make that decision based on context.

Adding the ability to step through word lists and paragraph joins-- rather than implementing them *all* automatically -- will prevent the tool from generating a new set of errors while correcting the old ones.
Taking the word list issue first, two different situations arise when correcting misspelt words: some words may be misspelt in the same way every time by OCR readers/converters where there is only one possible way of spelling these types of word correctly (eg presendy|presently, vou|you). Other words that are misspelt may or may not be errors or there may be alternative corrections that are applicable and these need to be looked at on a case-by-case basis.

Currently this plugin resolves the first situation as this was relatively straightforward to implement; it uses a word list to automatically correct words that are misspelt in the same way every time by OCR readers/converters that have only one possible way of spelling the misspelt word correctly.

I will consider adding a feature that offers alternative words for corrections to resolve the second situation; however, I don't have much time to develop the plugin (at the moment I am only carrying out 'tweaks'), so it may be a while before I can add this feature to the plugin.

Similarly paragraph joins can be an issue and some manual searching is necessary. The plugin will automatically join paragraphs that end with a hyphen to the next paragraph, paragraphs that begin with a lowercase letter to the previous one, paragraphs that end with Mrs.|Mr.|Dr.|St. to the next one and - if you tick the option 'Fix all broken line endings' - it will join paragraphs that end with a lowercase letter to those that begin with an upper case letter. If you do not tick this option then the plugin should not join paragraphs that have any other types of errors (eg it should not join paragraphs that end with lower case letters to the next paragraph if the next paragraph begins with a capital letter or punctuation mark unless this option is ticked - if you find that when you untick this option it does join paragraphs with other types of errors together then please let me know and give an example of two paragraphs that are incorrectly being joined together).

You can use the following regex expressions to do a manual Find/replace for paragraphs that have not been corrected automatically:

Find: ([a-z])</p>\s+<p>
Replace:\1 {There is a space after \1}


I may, in a future version, show each incorrectly terminated paragraph and provide the option to correct it manually if there is enough demand for this feature.
CalibUser is offline   Reply With Quote