View Single Post
Old 08-23-2015, 09:39 AM   #1
CalibUser
Addict
CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.
 
Posts: 203
Karma: 62362
Join Date: Jul 2015
Device: Sony
Plugin for tidying ePub files

Hi,

I have developed this plugin as a tool to help tidy up ePub files that have been converted from pdf documents but contain ocr errors. The plugin has the following features:
  • processes span tags, allowing tags to be removed or changed
  • corrects false line breaks
  • corrects miscellaneous errors, for example, removing unnecessary spaces, correcting the direction of apostrophe's, and inserting the tags <colgroup> and </colgroup> in tables where they are missing
  • reformats chapter titles
  • reassign header tags
  • uses a customised list of words to correct common misspellings in the OCR process
  • imports a customised css file
  • corrects incorrectly hyphenated words
  • has an option to format the xhtml files

The instructions for using the plugin are in the attached file named ePub tidy tool v3.0.1.0.epub.

Update 20th July 2020 The plugin has been updated to version 3.0.1.0. This version has an option to scan ePub files for hyphenated words and add them to a file of hyphenated words that must not be removed by this plugin.

Update 11th October 2020 There was an error in version 3.0.1.2 that affected lines that were commented with <!-- this is an html comment -->, corrupting the ePub. I have made a quick correction in the attached file, version 3.0.1.3, although the error reporting facility will report the following for each comment found:

"Replaced a series of short/long hyphens with one long hyphen 2
Replaced <space><long hypen><space> with one long hyphen 2"

Update 21 November 2020
Bug fixes
The number of changes reported under Replaced a series of short/long hyphens with one long hyphen and Replaced <space><long hypen><space> with one long hyphen was incorrect; this has been fixed.

A quote mark next to a speech mark (eg ’") caused one of these marks to be moved to a line by itself; this has been fixed.

Important: Please ensure that you keep a back up of your original ePub file before running this plugin.

When some old publications are OCR'd some words are frequently misspelt in the same way in every scan. I am attaching a file that can be used with the plugin to correct the spelling of these words. It is based on a file provided by martyger at https://www.mobileread.com/forums/sh...d.php?t=265830 and includes updates from Steadyhands at https://www.mobileread.com/forums/sh...&postcount=154

Gipsy has put files containing Greek words for this plugin in this thread at:
https://www.mobileread.com/forums/sh...65#post3208365

Update 18 April 2022
Bug fix

A fix has been made to address the issue raised by Thasaidon

Enjoy!
Attached Files
File Type: txt IncorrectWords.txt (1.3 KB, 3692 views)
File Type: epub ePub tidy tool v3.0.1.0.epub (17.5 KB, 2408 views)
File Type: zip ePubTidyTool_v3.0.1.6.zip (43.9 KB, 2777 views)

Last edited by CalibUser; 04-18-2022 at 05:51 AM. Reason: Bug fixes
CalibUser is offline   Reply With Quote