Hi,
I have developed this plugin as a tool to help tidy up ePub files that have been converted from pdf documents but contain ocr errors. The plugin has the following features:
- processes span tags, allowing tags to be removed or changed
- corrects false line breaks
- corrects miscellaneous errors, for example, removing unnecessary spaces, correcting the direction of apostrophe's, and inserting the tags <colgroup> and </colgroup> in tables where they are missing
- reformats chapter titles
- reassign header tags
- uses a customised list of words to correct common misspellings in the OCR process
- imports a customised css file
- corrects incorrectly hyphenated words
- has an option to format the xhtml files
The instructions for using the plugin are in the attached file named
ePub tidy tool v3.0.1.0.epub.
Update 20th July 2020 The plugin has been updated to version 3.0.1.0. This version has an option to scan ePub files for hyphenated words and add them to a file of hyphenated words that must not be removed by this plugin.
Update 11th October 2020 There was an error in version 3.0.1.2 that affected lines that were commented with
<!-- this is an html comment -->, corrupting the ePub. I have made a quick correction in the attached file, version 3.0.1.3, although the error reporting facility will report the following for each comment found:
"Replaced a series of short/long hyphens with one long hyphen 2
Replaced <space><long hypen><space> with one long hyphen 2"
Update 21 November 2020
Bug fixes
The number of changes reported under
Replaced a series of short/long hyphens with one long hyphen and
Replaced <space><long hypen><space> with one long hyphen was incorrect; this has been fixed.
A quote mark next to a speech mark (eg ’") caused one of these marks to be moved to a line by itself; this has been fixed.
Important: Please ensure that you keep a back up of your original ePub file before running this plugin.
When some old publications are OCR'd some words are frequently misspelt in the same way in every scan. I am attaching a file that can be used with the plugin to correct the spelling of these words. It is based on a file provided by martyger at
https://www.mobileread.com/forums/sh...d.php?t=265830 and includes updates from Steadyhands at
https://www.mobileread.com/forums/sh...&postcount=154
Gipsy has put files containing Greek words for this plugin in this thread at:
https://www.mobileread.com/forums/sh...65#post3208365
Update 18 April 2022
Bug fix
A fix has been made to address the issue raised by Thasaidon
Enjoy!