![]() |
Plugin for tidying ePub files
3 Attachment(s)
Hi,
I have developed this plugin as a tool to help tidy up ePub files that have been converted from pdf documents but contain ocr errors. The plugin has the following features:
The instructions for using the plugin are in the attached file named ePub tidy tool v3.0.1.0.epub. Update 20th July 2020 The plugin has been updated to version 3.0.1.0. This version has an option to scan ePub files for hyphenated words and add them to a file of hyphenated words that must not be removed by this plugin. Update 11th October 2020 There was an error in version 3.0.1.2 that affected lines that were commented with <!-- this is an html comment -->, corrupting the ePub. I have made a quick correction in the attached file, version 3.0.1.3, although the error reporting facility will report the following for each comment found: "Replaced a series of short/long hyphens with one long hyphen 2 Replaced <space><long hypen><space> with one long hyphen 2" Update 21 November 2020 Bug fixes The number of changes reported under Replaced a series of short/long hyphens with one long hyphen and Replaced <space><long hypen><space> with one long hyphen was incorrect; this has been fixed. A quote mark next to a speech mark (eg ’") caused one of these marks to be moved to a line by itself; this has been fixed. Important: Please ensure that you keep a back up of your original ePub file before running this plugin. When some old publications are OCR'd some words are frequently misspelt in the same way in every scan. I am attaching a file that can be used with the plugin to correct the spelling of these words. It is based on a file provided by martyger at https://www.mobileread.com/forums/sh...d.php?t=265830 and includes updates from Steadyhands at https://www.mobileread.com/forums/sh...&postcount=154 Gipsy has put files containing Greek words for this plugin in this thread at: https://www.mobileread.com/forums/sh...65#post3208365 Enjoy! |
I have updated the plugin. It corrects a few more errors in ePub files and also has a new tool to help with formatting chapter titles. I have put the new plugin in the first post in this thread.
As always, ensure you have a backup of your ePub book before running this plugin. |
Quote:
IMHO, it's a bit confusing, though, that the user has to press Cancel to close the UI. Ideally, the UI should self-destroy after the plugin is done. |
Should this plugin work under Windows 10? I'm getting no setup screen, then if I run anyway it fails with:
TclError: Can't find a usable init.tcl in the following directories: C:/Python34/lib/tcl8.6 C:/lib/tcl8.6 C:/lib/tcl8.6 C:/library C:/library C:/tcl8.6.1/library C:/tcl8.6.1/library This probably means that Tcl wasn't installed properly. |
Quote:
@CalibUser: Did you install the official Python 3.4.x build from the official Python website (python.org)? |
Sorted. By installing the latest release of Python 3.4 from python.org.
|
This plugin thread has been to the official Sigil Plugin Index thread here:
https://www.mobileread.com/forums/sho...d.php?t=247431 KevinH |
Hi,
@ Doitsu: I am using Python version 3.4.0 from the Python Software Foundation. "it's a bit confusing, though, that the user has to press Cancel to close the UI. Ideally, the UI should self-destroy after the plugin is done" In Windows 7 my plugin shuts itself down, although the Sigil Plugin Runner Window stays open. I use this to report the changes made. Is it the the Sigil Plugin Runner Window that needs to be closed using the cancel button, or is it my plugin? On my system I click the OK button to close the Sigil Window. |
1 Attachment(s)
Maybe I don't understand how to use the plugin correctly or how the plugin works.
I created the following test file: Code:
<?xml version="1.0" encoding="utf-8" standalone="no"?>The plugin displayed the following message in the Plugin Runner dialog box: Code:
ID: Section0001.xhtml href: Text/Section0001.xhtmlI had to click the Cancel button in the TK window to terminate the plugin. |
Thanks for the feedback.
I removed my debugging code from the plugin and this seems to have caused a problem - I probably removed something that I should have left in place. I will try to work out what has happened. |
I have fixed a bug in this plugin and uploaded it to the first post in this thread.
The plugin should close automatically, update the ePub file and display the changes made in the Plugin Runner dialog box. |
Quote:
Can someone who uses a Linux distro other than Debian Jessie or a Mac please test the plugin with the following test file? Code:
<?xml version="1.0" encoding="utf-8" standalone="no"?>(This should merge the two broken sentences.) |
Doitsu -- I am running Arch Linux, so my python is the latest version (3.4.3). :D
Installed the plugin, entered your test file, ran the plugin.... clicked OK... Code:
ID: Section0001.xhtml href: Text/Section0001.xhtml... Ah, but if I click Cancel it reports success. No changes, just success. :rolleyes: |
At first... Thanks for your work :)
It save me some time from manual editing :P I want to ask you something... In greek sometimes the epub contains 'Ε or "Ε for Έ. There is any way to add it to the checks of the plugin? It's not necessary to add it to the plugin for all. I want to try it at first if it works fine :) Thanks EDIT: Found it :P |
I believe the problem on Linux is the path specified in the plugin for the dictionary (I don't have Linux so I can't confirm this). I have updated the plugin in the first post in this thread so that when the plugin is run for the first time, it asks for the location and filename of the dictionary (see the epub in the first post for details) that is used for correcting hyphenated words that should not be hyphenated. Hopefully this will resolve the problem in Linux so that it will not run and run, nor require the Cancel button to be pressed to exit.
I have improved the plugin for working with Chapter headings: Some words such as 'an' do not normally start with a capital letter when the heading is in titlecase. I have amended the plugin so that these words are now in lower case when titlecase is selected in the plugin. If you come across any words that should be lowercase but appear in titlecase then please let me know and I will update in the next version of this plugin. With the previous version of the plugin when titlecase is applied to a chapter heading the first Roman numeral is capitalised and the remainder are in lower case; I have added an option to the 'Format chapter titles' dialog so that the user can select the required case for Roman numerals when title case is applied. The plugin does require version 3.4 of Python - I should have mentioned this sooner. @DiapDealer: Please remove the posts concerning the debate on the version of Python that is used as this detracts from the purpose of this thread. Thanks. @davidfor: This plugin is for Sigil - my user name is misleading. Originally I joined the forum when there were no plans to develop Sigil further, so I chose my user name as CalibUser; when I found out that Sigil would continue to be developed I carried on using Sigil as my preferred ePub editor - I don't think it's possible to change user names. However, I do use Calibre for other functions. |
@CalibUser: I've just tested the updated plugin with my Linux machine and appears to be working fine. (I only tested the line break fix.)
|
The Fix for false line breaks doesn't work in greek language.
I use the following regex to fix the lines breaks. Code:
Find: ([\p{Greek},'–’“”][</ib>]*)</p>\s+<p>([<ib>]*[\p{Greek},'–’“”])Code:
if allBreaks == 'Yes':Code:
if allBreaks == 'Yes':I don't know any python. Is my code ok? Thanks :) |
@Doitsu: Thanks for testing the plugin
@gipsy: In your code: Code:
r'([\p{Greek}\,\'–’“”][</ib>]*)</p>\s+<p>([<ib>]*[\p{Greek},\'–’“”])' |
You are right. But again they don't compine :(
Yes is for greek characters. Code:
<p>ο Πυθέας ήπιε το υπόλοιπο</p>Code:
<p>ο Πυθέας ήπιε το υπόλοιπο γάλα από το κύπελλο, σκούπισε δυο σταγόνες στα χείλη του με την ανάστροφη του χεριού του και σηκώθηκε.</p> |
Quote:
AFAIK, Python doesn't support the \p{Greek} syntax. I.e., Greek letters need to be explicitly expressed as Unicode ranges (0370–03FF). |
Quote:
Code:
userProfile = (os.environ['USERPROFILE']) #Get path to user profileFor that matter, it is highly environment-specific -- it would also break hard on a PortableApps.com install, for example. Is there any way in Sigil/the plugin container to access the value of the Sigil configuration folder? This would be a far, far better way of handling it. (If there isn't a way, then it would be a generally useful thing to have...) Asking the user to manually select the dictionary just to get around the issue of finding the configuration directory is overkill (and slightly onerous) -- although it could be useful if one has multiple dictionaries and wants to use a specific one, that is probably an edge case. EDIT: And of course the instructions already make it clear that that won't work. |
FWIW,
The next release of Sigil will include an interface to the hunspell spellchecker and will provide a list of paths to the hunspell dictionaries. If I can figure out how best to bundle sigil's version of gumbo for use by plugins, and if DiapDealer and I can fix some bugs, we should have a release out in 2 or 3 weeks. Kevin |
1 Attachment(s)
Quote:
@CalibUser: Python has a boatload of built-in functions for cross-platform file handling that make it really easy to implement cross-platform file support. Since the Sigil plugin root directory and the user_dictionary directory are sibling directories it's relatively easy to get the user_dictionary directory location. For example, you could use the following code to get the dictionary folder: Code:
import os, inspectWindows: Code:
C:\Users\Doitsu\AppData\Local\sigil-ebook\sigil\plugins\testCode:
/home/doitsu/.local/share/sigil-ebook/sigil/plugins/test |
Quote:
I change it to Code:
if allBreaks == 'Yes':Quote:
|
@CalibUser: Those are some fixes in greek language if you want to place them in your plygin. I try to find a solution and for some other things and i keep you posted :D
Code:
#Greek line break fixCode:
#Fixes Έ when PDFd as 'Ε or "Ε |
Quote:
I would, however suggest something other than the relatively fragile method of converting a path to a list of strings and then using the [:7] slice to strip off the last two directories. If the depth of that path ever increases, it won't point to the sigil preferences directory anymore. To be clear: it's the [:7] slice I find fragile, not the list of strings conversion and eventual re-joining. I would suggest using [:-2] if you're going to split the path into a list of strings that later get rejoined. Or just use os.path.dirname twice without converting to a list of strings and rejoining. It's all a bit fragile I guess (even mine), considering that the plugin directory could conceivably change in relation to the Sigil preferences directory. Code:
import os, inspectYou could also determine the path of the current plugin script in the run method of a plugin by using: Code:
def run(bk): |
Code:
#Greek line break fix |
Quote:
Quote:
(I guess it really depends on the app. I know they prefer if at all possible to not do that, it reduces the "portability" angle by potentially leaving unwanted cruft on the host computer.) :chinscratch: It doesn't look like there is any way to override the settings folder location in Sigil. (And it uses the deprecated-since-5.4 DataLocation, rather than AppDataLocation on Windows and AppConfigLocation on unix -- did Qt have to split it? :blink: -- which explains why the config folder is in ~/.local/share/sigil-ebook -- I have always wondered at that non-standard location.) |
Quote:
And there's just no "real" pressing need to convert and potentially lose user-settings/plugins in an upgrade (or create a one-time script to copy stuff to the new location). Maybe someday it will change, but it's just not high on the list of priorities at the moment. |
Quote:
Whether either is *necessary*, I won't venture to say. I agree once it's been used you shouldn't break everyone's settings just to conform to more "proper" standards. |
Thanks for all these suggestions and comments. When I get time, I will look at implementing some of the ideas presented above:
@Doitsu: Thanks for the directory code and experimental plugin - I will experiment with your plugin as soon as I have time. @gipsy: Thanks for the code for Greek ePubs. I will incorporate this code in the next version of the plugin. @DiapDealer: As you do not really recommended accessing script properties/methods directly, I will try the solution offered by Doitsu; I will update from Doitsu's solution when the hunspell/dictionaries is incorporated into the plugin launcher framework. |
Quote:
|
Quote:
|
The plugin has been updated so that it will automatically find the folder for the spelling dictionary using code suggested by Doitsu and DiapDealer.
I have also incorporated code from gipsy to manage Greek letters. @gipsy:I had to represent the Greek characters as unicode numbers since my editor cannot handle unicode characters! If you get time, please check that the code works for Greek texts in case I have mistyped the unicode numbers. |
@CalibUser
Change them to this and there are fine :) EDIT: Sorry they didn't work with the replace in unicode code For example the "γΰρω" is changed to "γ\u03CDρω" EDIT 2: For some reason the hyphen doesn't work at me now. :blink: I think I found the reason... In windows... The ePubTidyTool.json has the DictFile path as Code:
"DictFile": "C:\\Users\\pm\\AppData\\Local\\sigil-ebook\\sigil\\user_dictionaries\\WordDictionary.txt",Code:
"DictFile": "C:/Users/pm/AppData/Local/sigil-ebook/sigil/hunspell_dictionaries/WordDictionary.txt", |
Quote:
Change the following line from: Code:
CorrectText("Changed \u03CD to \u03B0", r'\u03B0', r'\u03CD')Code:
CorrectText("Changed \u03CD to \u03B0", r'ΰ', r'ύ') |
That's correct Doitsu :P
i'm gonna send the code to CalibUser because his editor cannot handle greek characters. |
CalibUser if you can copy-paste them in your editor those are some fixes for now.
Or tell me how to send them to you :) Code:
#------------------------ Greek character corrections ------------- |
I have updated the plugin to process Greek errors as suggested by Gipsy - I haven't been able to test the update using a Greek text as I am not familiar with this language.
|
Quote:
Thanks CalibUser They work fine. The only problem is that it doesn't process the Hyphens. Maybe windows doesn't recognize the path in ePubTidyTool.json Code:
"DictFile": "C:\\Users\\owner\\AppData\\Local\\sigil-ebook\\sigil\\user_dictionaries\\WordDictionary.txt", |
| All times are GMT -4. The time now is 08:29 PM. |
Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.