Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Sigil > Plugins

Notices

Reply
 
Thread Tools Search this Thread
Old 08-23-2015, 09:39 AM   #1
CalibUser
Addict
CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.
 
Posts: 201
Karma: 62362
Join Date: Jul 2015
Device: Sony
Plugin for tidying ePub files

Hi,

I have developed this plugin as a tool to help tidy up ePub files that have been converted from pdf documents but contain ocr errors. The plugin has the following features:
  • processes span tags, allowing tags to be removed or changed
  • corrects false line breaks
  • corrects miscellaneous errors, for example, removing unnecessary spaces, correcting the direction of apostrophe's, and inserting the tags <colgroup> and </colgroup> in tables where they are missing
  • reformats chapter titles
  • reassign header tags
  • uses a customised list of words to correct common misspellings in the OCR process
  • imports a customised css file
  • corrects incorrectly hyphenated words
  • has an option to format the xhtml files

The instructions for using the plugin are in the attached file named ePub tidy tool v3.0.1.0.epub.

Update 20th July 2020 The plugin has been updated to version 3.0.1.0. This version has an option to scan ePub files for hyphenated words and add them to a file of hyphenated words that must not be removed by this plugin.

Update 11th October 2020 There was an error in version 3.0.1.2 that affected lines that were commented with <!-- this is an html comment -->, corrupting the ePub. I have made a quick correction in the attached file, version 3.0.1.3, although the error reporting facility will report the following for each comment found:

"Replaced a series of short/long hyphens with one long hyphen 2
Replaced <space><long hypen><space> with one long hyphen 2"

Update 21 November 2020
Bug fixes
The number of changes reported under Replaced a series of short/long hyphens with one long hyphen and Replaced <space><long hypen><space> with one long hyphen was incorrect; this has been fixed.

A quote mark next to a speech mark (eg ’") caused one of these marks to be moved to a line by itself; this has been fixed.

Important: Please ensure that you keep a back up of your original ePub file before running this plugin.

When some old publications are OCR'd some words are frequently misspelt in the same way in every scan. I am attaching a file that can be used with the plugin to correct the spelling of these words. It is based on a file provided by martyger at https://www.mobileread.com/forums/sh...d.php?t=265830 and includes updates from Steadyhands at https://www.mobileread.com/forums/sh...&postcount=154

Gipsy has put files containing Greek words for this plugin in this thread at:
https://www.mobileread.com/forums/sh...65#post3208365

Update 18 April 2022
Bug fix

A fix has been made to address the issue raised by Thasaidon

Enjoy!
Attached Files
File Type: txt IncorrectWords.txt (1.3 KB, 3184 views)
File Type: epub ePub tidy tool v3.0.1.0.epub (17.5 KB, 1529 views)
File Type: zip ePubTidyTool_v3.0.1.6.zip (43.9 KB, 1457 views)

Last edited by CalibUser; 04-18-2022 at 05:51 AM. Reason: Bug fixes
CalibUser is offline   Reply With Quote
Old 08-31-2015, 01:36 PM   #2
CalibUser
Addict
CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.
 
Posts: 201
Karma: 62362
Join Date: Jul 2015
Device: Sony
I have updated the plugin. It corrects a few more errors in ePub files and also has a new tool to help with formatting chapter titles. I have put the new plugin in the first post in this thread.

As always, ensure you have a backup of your ePub book before running this plugin.
CalibUser is offline   Reply With Quote
Advert
Old 08-31-2015, 04:25 PM   #3
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,584
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
Quote:
Originally Posted by CalibUser View Post
I have updated the plugin at https://www.mobileread.com/forums/sho...d.php?t=264378. It should work on the other Operating Systems, although I have not tested it on these.
The plugin installed fine with the latest Linux version of Sigil and appears to be working as designed.

IMHO, it's a bit confusing, though, that the user has to press Cancel to close the UI. Ideally, the UI should self-destroy after the plugin is done.
Doitsu is offline   Reply With Quote
Old 08-31-2015, 05:48 PM   #4
exaltedwombat
Guru
exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.
 
Posts: 878
Karma: 2457540
Join Date: Nov 2011
Device: none
Should this plugin work under Windows 10? I'm getting no setup screen, then if I run anyway it fails with:

TclError: Can't find a usable init.tcl in the following directories:
C:/Python34/lib/tcl8.6 C:/lib/tcl8.6 C:/lib/tcl8.6 C:/library C:/library C:/tcl8.6.1/library C:/tcl8.6.1/library
This probably means that Tcl wasn't installed properly.
exaltedwombat is offline   Reply With Quote
Old 08-31-2015, 06:06 PM   #5
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,584
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
Quote:
Originally Posted by exaltedwombat View Post
Should this plugin work under Windows 10? I'm getting no setup screen, then if I run anyway it fails with:

TclError: Can't find a usable init.tcl in the following directories:
C:/Python34/lib/tcl8.6 C:/lib/tcl8.6 C:/lib/tcl8.6 C:/library C:/library C:/tcl8.6.1/library C:/tcl8.6.1/library
This probably means that Tcl wasn't installed properly.
I got the same error on my Windows 10 machine. Did you by any chance also install ActivePython 2.7.x and 3.4.x on your machine?

@CalibUser: Did you install the official Python 3.4.x build from the official Python website (python.org)?
Doitsu is offline   Reply With Quote
Advert
Old 08-31-2015, 06:49 PM   #6
exaltedwombat
Guru
exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.
 
Posts: 878
Karma: 2457540
Join Date: Nov 2011
Device: none
Sorted. By installing the latest release of Python 3.4 from python.org.
exaltedwombat is offline   Reply With Quote
Old 09-02-2015, 12:22 PM   #7
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
This plugin thread has been to the official Sigil Plugin Index thread here:

https://www.mobileread.com/forums/sho...d.php?t=247431

KevinH

Last edited by KevinH; 09-02-2015 at 12:59 PM.
KevinH is offline   Reply With Quote
Old 09-02-2015, 02:26 PM   #8
CalibUser
Addict
CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.
 
Posts: 201
Karma: 62362
Join Date: Jul 2015
Device: Sony
Hi,

@ Doitsu: I am using Python version 3.4.0 from the Python Software Foundation.

"it's a bit confusing, though, that the user has to press Cancel to close the UI. Ideally, the UI should self-destroy after the plugin is done"

In Windows 7 my plugin shuts itself down, although the Sigil Plugin Runner Window stays open. I use this to report the changes made. Is it the the Sigil Plugin Runner Window that needs to be closed using the cancel button, or is it my plugin? On my system I click the OK button to close the Sigil Window.
CalibUser is offline   Reply With Quote
Old 09-02-2015, 04:47 PM   #9
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,584
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
Maybe I don't understand how to use the plugin correctly or how the plugin works.

I created the following test file:

Code:
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
  "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <title></title>
</head>

<body>
  <p>I went to</p>

  <p>California for my holiday.</p>

  <p>I went to</p>

  <p>my favorite bar yesterday.</p>
</body>
</html>
I then started the plugin and selected only "Fix ALL broken line endings" and clicked OK.

The plugin displayed the following message in the Plugin Runner dialog box:

Code:
ID: Section0001.xhtml	href: Text/Section0001.xhtml
Open quote:  "
Close quote:  "
Apostrophe:  '
but nothing got changed and both the Plugin Runner dialog box and the TK dialog box remained visible.

I had to click the Cancel button in the TK window to terminate the plugin.
Attached Thumbnails
Click image for larger version

Name:	dialog.png
Views:	1063
Size:	13.4 KB
ID:	141605  
Doitsu is offline   Reply With Quote
Old 09-03-2015, 01:56 PM   #10
CalibUser
Addict
CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.
 
Posts: 201
Karma: 62362
Join Date: Jul 2015
Device: Sony
Thanks for the feedback.
I removed my debugging code from the plugin and this seems to have caused a problem - I probably removed something that I should have left in place.

I will try to work out what has happened.
CalibUser is offline   Reply With Quote
Old 09-03-2015, 03:06 PM   #11
CalibUser
Addict
CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.
 
Posts: 201
Karma: 62362
Join Date: Jul 2015
Device: Sony
I have fixed a bug in this plugin and uploaded it to the first post in this thread.

The plugin should close automatically, update the ePub file and display the changes made in the Plugin Runner dialog box.
CalibUser is offline   Reply With Quote
Old 09-03-2015, 07:00 PM   #12
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,584
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
Quote:
Originally Posted by CalibUser View Post
I have fixed a bug in this plugin and uploaded it to the first post in this thread.
The new version works with Windows, but not with my Linux version (Debian Jessie), however, this is most likely caused by some incompatible library on my system or maybe because Debian Jessie comes with Python 3.4.2 and Windows with Python 3.4.3.

Can someone who uses a Linux distro other than Debian Jessie or a Mac please test the plugin with the following test file?

Code:
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
  "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <title></title>
</head>

<body>
  <p>I went to</p>

  <p>California for my holiday.</p>

  <p>I went to</p>

  <p>my favorite bar yesterday.</p>
</body>
</html>
Select only "Fix ALL broken line endings" and click OK.
(This should merge the two broken sentences.)
Doitsu is offline   Reply With Quote
Old 09-03-2015, 10:19 PM   #13
eschwartz
Ex-Helpdesk Junkie
eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.
 
eschwartz's Avatar
 
Posts: 19,422
Karma: 85397180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
Doitsu -- I am running Arch Linux, so my python is the latest version (3.4.3).

Installed the plugin, entered your test file, ran the plugin.... clicked OK...

Code:
ID: Section0001.xhtml	href: Text/Section0001.xhtml
Open quote:  "
Close quote:  "
Apostrophe:  '
Still running and running and running.


...


Ah, but if I click Cancel it reports success. No changes, just success.

Last edited by eschwartz; 09-03-2015 at 10:22 PM.
eschwartz is offline   Reply With Quote
Old 09-05-2015, 04:07 AM   #14
gipsy
Connoisseur
gipsy began at the beginning.
 
Posts: 81
Karma: 10
Join Date: Nov 2013
Device: Kobo Aura HD
At first... Thanks for your work
It save me some time from manual editing :P

I want to ask you something... In greek sometimes the epub contains 'Ε or "Ε for Έ.
There is any way to add it to the checks of the plugin? It's not necessary to add it to the plugin for all. I want to try it at first if it works fine

Thanks

EDIT: Found it :P

Last edited by gipsy; 09-05-2015 at 07:16 AM.
gipsy is offline   Reply With Quote
Old 09-05-2015, 10:44 AM   #15
CalibUser
Addict
CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.CalibUser goes to eleven.
 
Posts: 201
Karma: 62362
Join Date: Jul 2015
Device: Sony
I believe the problem on Linux is the path specified in the plugin for the dictionary (I don't have Linux so I can't confirm this). I have updated the plugin in the first post in this thread so that when the plugin is run for the first time, it asks for the location and filename of the dictionary (see the epub in the first post for details) that is used for correcting hyphenated words that should not be hyphenated. Hopefully this will resolve the problem in Linux so that it will not run and run, nor require the Cancel button to be pressed to exit.

I have improved the plugin for working with Chapter headings: Some words such as 'an' do not normally start with a capital letter when the heading is in titlecase. I have amended the plugin so that these words are now in lower case when titlecase is selected in the plugin. If you come across any words that should be lowercase but appear in titlecase then please let me know and I will update in the next version of this plugin.

With the previous version of the plugin when titlecase is applied to a chapter heading the first Roman numeral is capitalised and the remainder are in lower case; I have added an option to the 'Format chapter titles' dialog so that the user can select the required case for Roman numerals when title case is applied.

The plugin does require version 3.4 of Python - I should have mentioned this sooner.

@DiapDealer: Please remove the posts concerning the debate on the version of Python that is used as this detracts from the purpose of this thread. Thanks.

@davidfor: This plugin is for Sigil - my user name is misleading. Originally I joined the forum when there were no plans to develop Sigil further, so I chose my user name as CalibUser; when I found out that Sigil would continue to be developed I carried on using Sigil as my preferred ePub editor - I don't think it's possible to change user names. However, I do use Calibre for other functions.
CalibUser is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Tidying Up My Kindle selectortone Calibre 2 07-17-2013 10:35 AM
developping a Plugin for Presentation files abdlink Plugins 4 04-15-2013 11:27 AM
Plugin to fix fb2 files oviksna Plugins 3 01-28-2013 08:53 AM
Tidying Up My Library JayLaFunk Library Management 2 09-20-2011 09:12 AM
Calibre 0.7.50 can't see plugin files mb_webguy Calibre 5 04-29-2011 03:41 AM


All times are GMT -4. The time now is 12:05 PM.


MobileRead.com is a privately owned, operated and funded community.