10-07-2015, 06:13 PM | #91 |
Connoisseur
Posts: 81
Karma: 10
Join Date: Nov 2013
Device: Kobo Aura HD
|
And I test the following...
Code:
############ FIXES Π ########### def IsFixP(m): """ This function examines a word to see whether is required to fix the Π character that is misspelled. It is called by a regular expression function (re.sub) in FixCommonErrors() It returns the original expression if the checked word is not in the dictionary, otherwise it returns the word without the Π fixed """ FixP="Π"+m.group(2) FixP2=m.group(1)+m.group(2) if spell(FixP2): return(m.group(0)) elif spell(FixP): print("FixP removed from: ", FixP) return ('Π'+m.group(2)) else: return(m.group(1)+m.group(2)) ############ FIXES έ ########### def IsFixE(m): """ This function examines a word to see whether is required to fix the έ character that is misspelled. It is called by a regular expression function (re.sub) in FixCommonErrors() It returns the original expression if the checked word is not in the dictionary, otherwise it returns the word without the έ fixed """ FixE=m.group(1)+"έ"+m.group(2) FixE2=m.group(1)+"ύ"+m.group(2) if spell(FixE2): return(m.group(1)+"ύ"+m.group(2)) elif spell(FixE): print("FixE removed from: ", FixE) return(m.group(1)+"έ"+m.group(2)) else: return(m.group(1)+"ύ"+m.group(2)) #################### #Fixes Π in words that are misspelled if dictExists == True: CorrectText("Π fixes",r"(1\ Ι|1\ Ι|1Ι|1I|ΓΙ|Γΐ|ΙΙ|II|Ι\ Ι|ΓΤ|ΙΊ|Ιί)[ ]?(\w+)(?![^<>]*>)(?!.*<body[^>]*>)", IsFixP) #Fixes έ in words that are misspelled if dictExists == True: CorrectText("έ fixes",r"(\w+)ύ(\w+)(?![^<>]*>)(?!.*<body[^>]*>)", IsFixE) I was trying to another one. But i can't figure out how to create the CorrectText for the («ρ|(ρ|4>|<ρ|ηι) . The group can be at the start or at the middle of some word. If i figure how to get the regex... i had another group to make a FixSomething I attach and a IncorrectWords for greek words. CalibUser... In FixP seems that it doesn't fix it when we have a lowercase after the (1\ Ι|ΓΙ|Γΐ|ΙΙ|II|Ι\ Ι|ΓΤ|ΙΊ|Ιί). When you have time can you check the code for both? Thanks! EDIT: A suggestion for the future and if it's possible... I add about 400-500 words to a user dictionary per epub and I edit the WordDictionary to add the new ones. It's possible to change the plugin to not use the WordDictionary but to get the words from the Sigils dictionary and selected userdictionaries? Something like how the sigil get the misspelled word in spellcheck. Last edited by gipsy; 10-07-2015 at 06:21 PM. |
10-08-2015, 03:32 PM | #92 |
Addict
Posts: 202
Karma: 62362
Join Date: Jul 2015
Device: Sony
|
Update to ePubTidy tool
I have uploaded a new version of the plugin in the first thread. The following changes have been made:
|
Advert | |
|
10-08-2015, 03:48 PM | #93 | |||
Addict
Posts: 202
Karma: 62362
Join Date: Jul 2015
Device: Sony
|
Quote:
Quote:
Quote:
|
|||
10-08-2015, 04:04 PM | #94 | |
Connoisseur
Posts: 81
Karma: 10
Join Date: Nov 2013
Device: Kobo Aura HD
|
Quote:
But i wanted to check the words that are fixed (with FixP & FixE to check if the code is working fine). But don't worry, I think i found how to make it |
|
10-09-2015, 08:09 AM | #95 |
Member
Posts: 11
Karma: 10
Join Date: Dec 2013
Device: none
|
CalibUser,
I have been testing the plug-in and I think it is on its way to becoming a very useful tool. However, IMO it needs one important modification -- a means of "stepping through" certain types of changes. Some modifications can be automatic -- character replacement, tag changes, etc. However, some changes need to be monitored. For example, sometimes an OCR will miss periods at the end of a paragraph or add spurious lowercase letters to the end of sentences -- the correct fix is to add a period or delete the character...*not* to join paragraphs. Also, many words (like arid/and, modem/modern, etc) may or may not be errors -- the user needs to make that decision based on context. Adding the ability to step through word lists and paragraph joins-- rather than implementing them *all* automatically -- will prevent the tool from generating a new set of errors while correcting the old ones. As far as I can see, this change will make the plug-in the most useful item in my pulp-conversion toolbox. Thanks. |
Advert | |
|
10-09-2015, 01:06 PM | #96 |
Connoisseur
Posts: 81
Karma: 10
Join Date: Nov 2013
Device: Kobo Aura HD
|
From my tests-edits i can say that.
The IncorectWords with the latest version works fine. It finds the whole word. The join paragraphs works fine. Only some errors with subtitles within the text. But there arent too much. Some replacements i had comment them in the plugin file (the sup 5, \ etc) because in greek the "λ" sometimes is recognized as \ Hyphens fix work fine with the dic support. The spans need some work, the upper only, sometimes we have and a italic within the smallcaps span. So maybe we can add and a italics upper in the selection menu. The greek FixP, FixE seems to work fine from.my tests. The counter of the corrections made are off but it's ok, it counter all the finds and not only the changed ones. I will attach the code here CalibUser and if you can add some text in the plugin output window to tell the user to check the changed words if they are ok. |
10-09-2015, 01:20 PM | #97 |
Connoisseur
Posts: 81
Karma: 10
Join Date: Nov 2013
Device: Kobo Aura HD
|
@CalibUser
Those are ok, if you can add a Message for the user such as when you haven't checked the Fix line breaks or in the Plugin Runner message window. Something like... "Please check the FixP & FixE words!!!" Code:
############ FIXES Π ########### def IsFixP(m): """ This function examines a word to see whether is required to fix the Π character that is misspelled. It is called by a regular expression function (re.sub) in FixCommonErrors() It returns the original expression if the checked word is not in the dictionary, otherwise it returns the word without the Π fixed """ FixP="Π"+m.group(2) FixP2=m.group(1)+m.group(2) if spell(FixP2): return(m.group(0)) elif spell(FixP): print("FixP: ",FixP2, " changed to ", FixP) return ('Π'+m.group(2)) else: return(m.group(1)+m.group(2)) ############ FIXES έ ########### def IsFixE(m): """ This function examines a word to see whether is required to fix the έ character that is misspelled. It is called by a regular expression function (re.sub) in FixCommonErrors() It returns the original expression if the checked word is not in the dictionary, otherwise it returns the word without the Π fixed """ FixE=m.group(1)+"έ"+m.group(2) FixE2=m.group(1)+"ύ"+m.group(2) if spell(FixE2): return(m.group(1)+"ύ"+m.group(2)) elif spell(FixE): print("FixE: ",FixE2, " changed to ", FixE) return(m.group(1)+"έ"+m.group(2)) else: return(m.group(1)+"ύ"+m.group(2)) Code:
#Fixes Π in words that are misspelled if dictExists == True: CorrectText("Π fixes",r"(1\ Ι|1\ Ι|1Ι|1I|ΓΙ|Γΐ|ΙΙ|II|Ι\ Ι|ΓΤ|ΙΊ|Ιί)[ ]?(\w+)(?![^<>]*>)(?!.*<body[^>]*>)", IsFixP) #Fixes έ in words that are misspelled if dictExists == True: CorrectText("έ fixes",r"(\w+)ύ(\w+)(?![^<>]*>)(?!.*<body[^>]*>)", IsFixE) |
10-10-2015, 07:07 AM | #98 | |
Addict
Posts: 202
Karma: 62362
Join Date: Jul 2015
Device: Sony
|
Quote:
Currently this plugin resolves the first situation as this was relatively straightforward to implement; it uses a word list to automatically correct words that are misspelt in the same way every time by OCR readers/converters that have only one possible way of spelling the misspelt word correctly. I will consider adding a feature that offers alternative words for corrections to resolve the second situation; however, I don't have much time to develop the plugin (at the moment I am only carrying out 'tweaks'), so it may be a while before I can add this feature to the plugin. Similarly paragraph joins can be an issue and some manual searching is necessary. The plugin will automatically join paragraphs that end with a hyphen to the next paragraph, paragraphs that begin with a lowercase letter to the previous one, paragraphs that end with Mrs.|Mr.|Dr.|St. to the next one and - if you tick the option 'Fix all broken line endings' - it will join paragraphs that end with a lowercase letter to those that begin with an upper case letter. If you do not tick this option then the plugin should not join paragraphs that have any other types of errors (eg it should not join paragraphs that end with lower case letters to the next paragraph if the next paragraph begins with a capital letter or punctuation mark unless this option is ticked - if you find that when you untick this option it does join paragraphs with other types of errors together then please let me know and give an example of two paragraphs that are incorrectly being joined together). You can use the following regex expressions to do a manual Find/replace for paragraphs that have not been corrected automatically: Find: ([a-z])</p>\s+<p> Replace:\1 {There is a space after \1} I may, in a future version, show each incorrectly terminated paragraph and provide the option to correct it manually if there is enough demand for this feature. |
|
10-10-2015, 07:08 AM | #99 | |||
Addict
Posts: 202
Karma: 62362
Join Date: Jul 2015
Device: Sony
|
Quote:
Quote:
Quote:
This is a bug that I will need to fix! Thanks for pointing it out. |
|||
10-10-2015, 07:53 AM | #100 | |
Connoisseur
Posts: 81
Karma: 10
Join Date: Nov 2013
Device: Kobo Aura HD
|
Quote:
I comment those in the HTMLProcessor Code:
CorrectText("Corrected <sup>5 and <sup>9", r"""<sup>[59]</sup>""", r'’') CorrectText("Corrected <sup>6</sup>", r"""<sup>6</sup>""", r'‘') CorrectText("Corrected / with quote mark", r"""(?s)([^<|>])(/)(?![^<>]*>)(?!.*<body[^>]*>)""", r'\1’') CorrectText("Corrected / with quote 'I'", r""" / """, r' I ') #NB Could be 1 on more rare occassions And the / because it's a period followed by a apostrophe in a greek vowel character I mean the counter in the FixP, FixE. In those i notice the difference in the changes with the counter. But it's OK |
|
10-10-2015, 04:41 PM | #101 |
Connoisseur
Posts: 81
Karma: 10
Join Date: Nov 2013
Device: Kobo Aura HD
|
@CalibUser
Here some text to explain what greek correction are made by the plugin Code:
<h1>Ελληνικές Διορθώσεις</h1> <ol> <li>Διορθώνει τις τρεις τελείες (<b>...</b>) σε <b>…</b><br/></li> <li>Διορθώνει το <b>ΐ]</b> σε <b>η</b><br/></li> <li>Διορθώνει το <b>σιη</b> σε <b>στη</b> ακόμα και όταν είναι μέρος λέξεων, <i>από τις δοκιμές που έχω κάνει είναι ασφαλές.</i><br/></li> <li>Διορθώνει τα <b>οτη</b>, <b>οτο</b>, <b>οτον</b>, <b>οτα</b>, <b>οις</b>, <b>οην</b> σε <b>στη, στο, στον, στα, στις, στην</b> μεμονομένα, όχι μέρη λέξεων.<br/></li> <li>Διορθώνει τα <b>τοιν</b>, <b>τιον</b> σε <b>των</b>.<br/></li> <li>Διορθώνει το <b>οιί</b> σε <b>ού</b><br/></li> <li>Διορθώνει το <b>σιις</b> σε <b>στις</b><br/></li> <li>Διορθώνει τα <b>σιο</b>, <b>σιου</b>, <b>σια</b> σε <b>στο</b>,<b> σου</b>,<b> σα</b><br/></li> <li>Διορθώνει τα <b>ο'ι</b> σε <b>ώ</b><br/></li> <li>Διορθώνει τα <b>γΓ</b>, <b>γΡ</b> σε <b>γι’</b><br/></li> <li>Διορθώνει το <b>νπ</b> σε <b>ντι</b><br/></li> <li>Διορθώνει το <b>ΓΓ</b> σε <b>Γι’</b><br/></li> <li>Μετατρέπει σε τονισμένα τα κεφαλαία φωνήεντα πχ τα <b>'Α</b>, <b>"Α</b> σε <b>Ά</b><br/></li> <li>Μετατρέπει τα <b>ΰ</b> σε <b>ύ</b>, <i>μερικά μπορεί να είναι λάθος αλλά είναι ελάχιστα.</i><br/></li> <li>Διορθώνει το <b>ε'</b> σε <b>έ</b><br/></li> <li>Βάζει κενό μετά από το τελικό σίγμα (<b>ς)</b> που ακολουθείται από γράμμα.<br/></li> <li>Διορθώνει τα <b>Π</b> που είναι σαν <b>1 Ι,1Ι, ΓΙ, Γΐ, II, Ι Ι, ΓΤ, ΙΊ, Ιί</b><u>εφόσον η λέξη που εμπεριέχονται υπάρχει στο λεξικό</u>. Αλλιώς το αφήνει ώς έχει.<br/></li> <li>Διορθώνει τα <b>έ</b> που είναι σαν <b>ύ</b> <u>εφόσον η λέξη που εμπεριέχονται υπάρχει στο λεξικό</u>. Αλλιώς το αφήνει ώς έχει.<br/></li> </ol> <p></p> <p><b>ΠΡΟΣΟΧΗ: </b>Για τις διορθώσεις 17 και 18 <u>καλό είναι να τσεκάρετε τις λέξεις που έγιναν οι αλλαγές. Θα φαίνονται στο παράθυρο του Plugin ως <b>FixP:</b> <b>FixE:</b></u><br/></p> |
10-13-2015, 04:18 AM | #102 |
Connoisseur
Posts: 81
Karma: 10
Join Date: Nov 2013
Device: Kobo Aura HD
|
Two more CalibUser
Code:
############ FIXES ώ ########### def IsFixO(m): """ This function examines a word to see whether is required to fix the (ιό|οί|ιο|οι) characterw that is misspelled. It is called by a regular expression function (re.sub) in FixCommonErrors() It returns the original expression if the checked word is not in the dictionary, otherwise it returns the word without the ώ fixed """ FixO=m.group(1)+"ώ"+m.group(3) FixO2=m.group(1)+m.group(2)+m.group(3) if spell(FixO2): return(m.group(1)+m.group(2)+m.group(3)) elif spell(FixO): print("FixΏ: ",FixO2, " changed to ", FixO) return(m.group(1)+"ώ"+m.group(3)) else: return(m.group(1)+m.group(2)+m.group(3)) ############ FIXES ω ########### def IsFixW(m): """ This function examines a word to see whether is required to fix the (ιό|οί|ιο|οι) characterς that is misspelled. It is called by a regular expression function (re.sub) in FixCommonErrors() It returns the original expression if the checked word is not in the dictionary, otherwise it returns the word without the ω fixed """ FixW=m.group(1)+"ω"+m.group(3) FixW2=m.group(1)+m.group(2)+m.group(3) if spell(FixW2): return(m.group(1)+m.group(2)+m.group(3)) elif spell(FixW): print("FixΩ: ",FixW2, " changed to ", FixW) return(m.group(1)+"ω"+m.group(3)) else: return(m.group(1)+m.group(2)+m.group(3)) -------------------------------------------------------------------- #Fixes ώ in words that are misspelled if dictExists == True: CorrectText("ώ fixes",r"(\w+)(ιίι|(ό|ο)|ίό|ο>|ο'ι|ιό|οί|ιο|οι|<ο|οϊ)(\w+)(?![^<>]*>)(?!.*<body[^>]*>)", IsFixO) #Fixes ω in words that are misspelled if dictExists == True: CorrectText("ω fixes",r"(\w+)(ιίι|(ό|ο)|ίό|ο>|ο'ι|ιό|οί|ιο|οι|<ο|οϊ)(\w+)(?![^<>]*>)(?!.*<body[^>]*>)", IsFixW) EDIT: How can I modify the regex to match and the last and first characters in a word? I noticed that they work only inside the word. Thanks! Last edited by gipsy; 10-14-2015 at 02:39 AM. |
11-01-2015, 02:34 PM | #103 |
Addict
Posts: 202
Karma: 62362
Join Date: Jul 2015
Device: Sony
|
Update for the ePub Tidy Tool - version, v0.1.1.6 available
Update for the ePub Tidy Tool
A new version, v0.1.1.6, has been attached to the first article in this thread and the manual has been updated. This plugin has been tested on Windows 7 and requires that Python 3 is installed on your computer. The following features have been added:
To use the customised word list you need to install Beautiful Soup. Instructions for this are given in the manual for Windows 7; for other systems (Mac, Linux)please search the web. Important: Beautiful Soup will change all html mark-ups (eg &lsquo to a single character (in this case, a left single quote mark) when it processes text. To ensure that the text processed by Beautiful Soup matches the html file exactly, it is necessary to tick the box Replace HTML code eg &msdash; to find all suspect words. This will change html characters in the ePub to single characters that are used in the search. The code that implements the manual word check is slow compared to the automatic word search. When you press a button to accept/reject changing a word, there may a brief pause while the plugin finds the next paragraph that contains a suspect word. Despite this, it is faster to use the plugin than to use the normal Find/Search facility that is built into Sigil where you would need to manually enter each word that could be suspect and also risk leaving some out! |
11-01-2015, 04:36 PM | #104 |
Guru
Posts: 878
Karma: 2457540
Join Date: Nov 2011
Device: none
|
Is this version supposed to work with the Python that comes along with Sigil 0.8.901?
Thanks. |
11-01-2015, 04:50 PM | #105 |
Ex-Helpdesk Junkie
Posts: 19,421
Karma: 85397180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
|
Now that Sigil's plugin launcher includes an interface to libhunspell and a way to retrieve hunspell dictionaries, is this plugin going to learn how to read those directly?
|
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Tidying Up My Kindle | selectortone | Calibre | 2 | 07-17-2013 10:35 AM |
developping a Plugin for Presentation files | abdlink | Plugins | 4 | 04-15-2013 11:27 AM |
Plugin to fix fb2 files | oviksna | Plugins | 3 | 01-28-2013 08:53 AM |
Tidying Up My Library | JayLaFunk | Library Management | 2 | 09-20-2011 09:12 AM |
Calibre 0.7.50 can't see plugin files | mb_webguy | Calibre | 5 | 04-29-2011 03:41 AM |