MobileRead Forums

MobileRead Forums (https://www.mobileread.com/forums/index.php)
-   Plugins (https://www.mobileread.com/forums/forumdisplay.php?f=268)
-   -   Plugin for tidying ePub files (https://www.mobileread.com/forums/showthread.php?t=264378)

CalibUser 09-06-2015 05:05 PM

Thanks for all these suggestions and comments. When I get time, I will look at implementing some of the ideas presented above:

@Doitsu: Thanks for the directory code and experimental plugin - I will experiment with your plugin as soon as I have time.

@gipsy: Thanks for the code for Greek ePubs. I will incorporate this code in the next version of the plugin.

@DiapDealer: As you do not really recommended accessing script properties/methods directly, I will try the solution offered by Doitsu; I will update from Doitsu's solution when the hunspell/dictionaries is incorporated into the plugin launcher framework.

Doitsu 09-06-2015 05:10 PM

Quote:

Originally Posted by CalibUser (Post 3165756)
@Doitsu: Thanks for the directory code and experimental plugin - I will experiment with your plugin as soon as I have time.

Even though my code works, you may want to use the updated version by DiapDealer, because his version is more robust and also more elegant.

JSWolf 09-06-2015 06:43 PM

Quote:

Originally Posted by Doitsu (Post 3165758)
Even though my code works, you may want to use the updated version by DiapDealer, because his version is more robust and also more elegant.

But there's no attachment for the plugin with the changes.

CalibUser 09-09-2015 03:39 PM

The plugin has been updated so that it will automatically find the folder for the spelling dictionary using code suggested by Doitsu and DiapDealer.

I have also incorporated code from gipsy to manage Greek letters.

@gipsy:I had to represent the Greek characters as unicode numbers since my editor cannot handle unicode characters! If you get time, please check that the code works for Greek texts in case I have mistyped the unicode numbers.

gipsy 09-09-2015 07:25 PM

@CalibUser
Change them to this and there are fine :)

EDIT: Sorry they didn't work with the replace in unicode code

For example the "γΰρω" is changed to "γ\u03CDρω"

EDIT 2: For some reason the hyphen doesn't work at me now. :blink:

I think I found the reason...
In windows...
The ePubTidyTool.json has the DictFile path as
Code:

"DictFile": "C:\\Users\\pm\\AppData\\Local\\sigil-ebook\\sigil\\user_dictionaries\\WordDictionary.txt",
to the previously version was
Code:

  "DictFile": "C:/Users/pm/AppData/Local/sigil-ebook/sigil/hunspell_dictionaries/WordDictionary.txt",

Doitsu 09-10-2015 06:07 AM

Quote:

Originally Posted by gipsy (Post 3167868)
EDIT: Sorry they didn't work with the replace in unicode code

For example the "γΰρω" is changed to "γ\u03CDρω"

Because of the idiotic rather counterintuitive way that Python handles Unicode strings, you'll have to use the actual characters instead of the Unicode codes if you want to avoid the whole Python Unicode encode/decode mess.

Change the following line from:

Code:

        CorrectText("Changed \u03CD to \u03B0", r'\u03B0', r'\u03CD')
to

Code:

        CorrectText("Changed \u03CD to \u03B0", r'ΰ', r'ύ')
This'll change γΰρω to γύρω.

gipsy 09-10-2015 06:14 AM

That's correct Doitsu :P
i'm gonna send the code to CalibUser because his editor cannot handle greek characters.

gipsy 09-14-2015 05:03 AM

CalibUser if you can copy-paste them in your editor those are some fixes for now.
Or tell me how to send them to you :)
Code:

#------------------------ Greek character corrections -------------

        #Fixes '…' when PDFd as ...
        CorrectText("Changed ... to …", r'\.\.\.', r'…')

        #Fixes 'στη' when PDFd as σιη
        CorrectText("Changed σιη to στη", r'σιη', r'στη')

        #Fixes 'στη' when PDFd as σιη
        CorrectText("Changed σιη to στη", r' σι(ον|ο) ', r' στ\1 ')

        #Fixes 'στις' when PDFd as σιις
        CorrectText("Changed σιις to στις", r'σιις', r'στις')
       
        #Fixes 'Άκουσ' when PDFd as Ακόυσ
        CorrectText("Changed Ακόυσ to Άκουσ", r'Ακόυσ', r'Άκουσ')
       
        #Fixes 'γι’' when PDFd as γΓ,γΡ
        CorrectText("Changed γΓ γΡ to γι’", r'(γΓ|γΡ)', r'γι’')

        #Fixes 'ντι' when PDFd as νπ
        CorrectText("Changed νπ to ντι", r'νπ', r'ντι')
       
        #Fixes 'Γι’' when PDFd as ΓΓ
        CorrectText("Changed ΓΓ to Γι’", r'ΓΓ ', r'Γι’ ')

        #Fixes 'σχεδίαζ' when PDFd as σχέδιαζ
        CorrectText("Changed σχέδιαζ to σχεδίαζ", r'σχέδιαζ', r'σχεδίαζ')
       
        #Fixes '\u0388' when PDFd as 'E "E
        CorrectText("Changed 'E,\"E to \u0388", r'(\'|\")(\u0395)', r'Έ')

        #Fixes \u038E when PDFd as 'Y or "Y
        CorrectText("Changed 'Y,\"Y to \u038E", r'(\'|\")(\u03A5)', r'Ύ')

        #Fixes \u038A when PDFd as 'I or "I
        CorrectText("Changed 'I,\"I to \u038A", r'(\'|\")(\u0399)', r'Ί')

        #Fixes \u038C when PDFd as 'O or "O
        CorrectText("Changed 'O,\"O to \u038C", r'(\'|\")(\u039F)', r'Ό')

        #Fixes \u0386 when PDFd as 'A or "A
        CorrectText("Changed 'A,\"A to \u0386", r'(\'|\")(\u0391)', r'Ά')

        #Fixes \u0389 when PDFd as 'H or "H
        CorrectText("Changed 'H,\"H to \u0389", r'(\'|")(\u0397)', r'Ή')

        #Fixes \u038F when PDFd as '\u03C9 or "\u03C9
        CorrectText("Changed '\u03C9,\"\u03C9 to \u038F", r'(\'|\")(\u03C9)', r'Ώ')

        #Fixes \u03CD when PDFd as \u03B0
        CorrectText("Changed \u03CD to \u03B0", r'ΰ', r'ύ')

        #Fixes \u03CD when PDFd as \u03B0
        CorrectText("Changed ε' to έ", r'ε\'', r'έ')


CalibUser 09-14-2015 03:13 PM

I have updated the plugin to process Greek errors as suggested by Gipsy - I haven't been able to test the update using a Greek text as I am not familiar with this language.

gipsy 09-14-2015 03:15 PM

Quote:

Originally Posted by CalibUser (Post 3170562)
I have updated the plugin to process Greek errors as suggested by Gipsy - I haven't been able to test the update using a Greek text as I am not familiar with this language.

I'm gonna test them :P
Thanks CalibUser

They work fine. The only problem is that it doesn't process the Hyphens. Maybe windows doesn't recognize the path in ePubTidyTool.json
Code:

  "DictFile": "C:\\Users\\owner\\AppData\\Local\\sigil-ebook\\sigil\\user_dictionaries\\WordDictionary.txt",

CalibUser 09-14-2015 03:39 PM

@gipsy: Strange that hyphens are not being processed. I have tested this facility on a Windows 7 PC and it is working (for English text). Are you having problems with Greek texts only, or are you also having problems with English text? Perhaps I need to go back to enabling the user to select the path for a the dictionary....however, when the new version of Sigil is published I will be looking at using the built-in dictionaries and hopefully that will resolve the problem.

gipsy 09-14-2015 04:09 PM

Quote:

Originally Posted by CalibUser (Post 3170580)
@gipsy: Strange that hyphens are not being processed. I have tested this facility on a Windows 7 PC and it is working (for English text). Are you having problems with Greek texts only, or are you also having problems with English text? Perhaps I need to go back to enabling the user to select the path for a the dictionary....however, when the new version of Sigil is published I will be looking at using the built-in dictionaries and hopefully that will resolve the problem.

It's like it doesn't find (or load corectly) the WordDictionary.txt.
I make a WordDictionary.txt with only 2 english words and a epub with the 2 words hyphened.
It doesn't change them. Maybe it's a Windows 8 issue, I'm gonna test it in a Windows 7 virtual machine ;)

gipsy 09-15-2015 04:13 AM

I found a solution while searching....
in plugin.py i change the
Code:

        return(dictionary_path)
to
Code:

        return(dictionary_path).replace("\\","/")
and i get the directory without double slashes in ePubTidyTool.json

It worked for windows 8, Windows 10. I don't know if it's working in linux, or older Windows versions :o

Doitsu 09-15-2015 05:10 AM

Quote:

Originally Posted by gipsy (Post 3170868)
I found a solution while searching....
in plugin.py i change the
Code:

        return(dictionary_path)
to
Code:

        return(dictionary_path).replace("\\","/")
and i get the directory without double slashes in ePubTidyTool.json

In JSON files certain special characters (among them slashes and backslashes) need to be "escaped." I.e., they need to be written twice so as not to "confuse" the parser.

I.e., what you've found is a feature not a bug. If removing hyphens doesn't work, there's probably a problem with the encoding of the user dictionary, which needs to be saved as a utf-8 file.

Re-save the user dictionary as a utf8 file with Windows Notepad and rerun the plugin.

If removing hyphens still doesn't work, attach a short Greek sample epub file with hyphenation issues and the user dictionary that you use.

gipsy 09-15-2015 05:36 AM

Doitsu i test it and with a 2 english words dictionary and a 2 hyphened words epub. And it doesn't worked.
The Dictionary is as UTF-8 without BOM (as Notepad++says)

Ok. That's weird...
I test the latest version of the plugin with a portable sigil and the hyphens fix works fine :chinscratch:
It doesn't work with the installed version of Sigil :blink:

To work with the installed version of Sigil you must have the path in json as
Code:

  "DictFile": "C:/Users/pm/AppData/Local/sigil-ebook/sigil/hunspell_dictionaries/WordDictionary.txt",
Damn windows 8


All times are GMT -4. The time now is 08:29 PM.

Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.