|
|
Thread Tools | Search this Thread |
01-13-2017, 09:37 AM | #31 | |||
Sigil Developer
Posts: 7,506
Karma: 5433350
Join Date: Nov 2009
Device: many
|
Quote:
Quote:
Quote:
Take care, KevinH |
|||
01-13-2017, 10:55 AM | #32 |
Witchman
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
|
@KevinH...I've now made the encoding checks much more strict, accurate and wider as you have advised. I now do a double check of the encoding by first reading the html charset meta tag encoding and comparing it to a list of encodings. Then the file is read in again as binary or bytes and it's checked with chardet. Then I compare the chardet encoding result and html tag encoding result and if they're equal I use it but if the chardet encoding result does not agree with the html meta tag encoding result then I always use the html tag encoding instead. This seems to work pretty well.
For the meta tag encoding, I look for about 20 different encodings - Western European, Baltic, Slavic, Cyrillic, US etc. Generally I've used most of the Windows cps and iso-8899 cps for the meta tag encoding comparisons. I had to do this because the chardet encoding results were so awful and inaccurate. Might try UnicodeDammit tomorrow to see if it gives better results. And after the proper encoding is found, I can then safely convert the html file to utf-8 as required. But I have to say that I'm really pleased with the overall results. I have one html file that always gives masses of mixed encoding errors after plugin conversion in the epub. I ran that file through the plugin with the new encoding checker function and it flew through and loaded into Sigil with no encoding errors and passed EpubCheck first go directly after conversion. Couldn't believe it. Really glad I took your advice because it's made such a heck of a difference to the conversions. Still testing the new encoding checker function, no problems so far. Will try testing a UnicodeDammit/detwingle version tomorrow and let you know how it turns out. My thanks also for all your advice above. I will try solving the file name problem later. Last edited by slowsmile; 01-13-2017 at 11:29 AM. |
Advert | |
|
01-13-2017, 11:04 AM | #33 |
Grand Sorcerer
Posts: 5,582
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
|
@bravosx: Please install the attached junk plugin, run it (via Plugins > Edit > junk) and post the results.
@slowsmile: You can use locale.getpreferredencoding() to detect the code page of Windows machines and use it with decode(). If the junk plugin reports a code page other than cp1252 for bravosx, you might be able to use this information to make your plugin more "bulletproof." If it doesn't the problem is most likely caused by something else. BTW, the junk plugin uses the following code: Spoiler:
For example, on my Windows machine I got the following results: Code:
OS: win32
Platform: Windows-10-10.0.14393-SP0
Python Version: 3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:18:55) [MSC v.1900 64 bit (AMD64)]
Preferred Encoding: cp1252
Code:
OS: linux
Platform: Linux-4.8.13-1-ARCH-x86_64-with-arch
Python Version: 3.6.0 (default, Dec 24 2016, 08:03:08) [GCC 6.2.1 20160830]
Preferred Encoding: UTF-8
|
01-13-2017, 11:46 AM | #34 |
Connoisseur
Posts: 99
Karma: 10
Join Date: Jun 2014
Location: Poland, Żory
Device: Prestigio PER3464B, Onyx Lynx, Lenovo S5000 i Tab4-8"
|
@Doitsu
In the program I install the plugin? bravosx I already have. I received such a result. OS: win32 Platform: Windows-10-10.0.14393-SP0 Python Version: 3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:18:55) [MSC v.1900 64 bit (AMD64)] Preferred Encoding: cp1250 Just do not understand why OS: win32 if you have installed win64. Last edited by bravosx; 01-13-2017 at 11:51 AM. |
01-13-2017, 12:24 PM | #35 |
Grand Sorcerer
Posts: 5,582
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
|
Am I right in assuming that you're using a Polish (or other Eastern European) Windows version?
You can ignore this information; it's a Python thing. The function I used can't tell 32bit and 64bit Windows versions apart. |
Advert | |
|
01-13-2017, 12:31 PM | #36 |
Connoisseur
Posts: 99
Karma: 10
Join Date: Jun 2014
Location: Poland, Żory
Device: Prestigio PER3464B, Onyx Lynx, Lenovo S5000 i Tab4-8"
|
@Doitsu...
Yes, I use Polish version of Windows 10 system 64. I'm from Poland. Regards bravosx Last edited by bravosx; 01-13-2017 at 12:35 PM. |
01-13-2017, 06:23 PM | #37 |
Witchman
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
|
@Doitsu...Thanks for your advice -- I've already incorporated getprefferred encoding() into my encoding check function. It all really depends on which method of obtaining the encoding you trust the most. I trust the html meta tag method the most while trusting the chardet method the least. I use the getprefferredencoding() method as a fallback. The chardet encoding results are also terribly inaccurate which is why I'll be testing UnicodeDammit/detwingle today as a possible substitute.
The check encoding function passes the discovered encoding to another function that converts the input file to utf-8. If there is an error in this converter function -- due to the discovered encoding being wrong -- then this function will throw an exception and the plugin app will not continue. My function(which I'm still testing) now looks like this: Spoiler:
Last edited by slowsmile; 01-13-2017 at 06:52 PM. |
01-13-2017, 07:47 PM | #38 |
Witchman
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
|
Just discovered the one thing that chardet is very good at -- chardet is faultless in its detection of the whole family of UTF encodings including utf-7, utf-8, utf-8-sig, utf-16, utf-16BE utf-32 etc. I've therefore changed my encoding detection function to only allow chardet results for the UTF family of encodings. This is as far as I can go I think.
Last edited by slowsmile; 01-13-2017 at 07:49 PM. |
01-13-2017, 10:05 PM | #39 |
Sigil Developer
Posts: 7,506
Karma: 5433350
Join Date: Nov 2009
Device: many
|
slowsmile,
Sigil uses the following code to identify the encoding of an html file when File->Open is run on one: https://github.com/Sigil-Ebook/Sigil...ngResolver.cpp The algorithm looks like about like this: - read file in bytes - check first 4 bytes for byte order marks to id utf-8, utf-16le, utf-16be, utf-32le, utf-32be - convert up to 1024 of first bytes to string using utf-8 ignoring errors to create text snippet - use regular expressions on snippet to look for encoding or charset attributes with or without delimiters to extract encoding name and use that codec to covert it - if all else fails, quick parse entire file as utf-8 and if no errors use utf-8 - finally just use the local encoding Hope this helps, KevinH |
01-13-2017, 11:45 PM | #40 |
Witchman
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
|
@KevinH...Altering the zip file name encoding to utf-8 using zipinfo flags might work on Linux and Mac but apparently, according to the python issue tracker, it still doesn't work on Windows.
Python Issue27344 I also found out that the -U switch will work on the command line version of PKZip or WinZip to add utf-8 file names to the Zip file. But no apparent capability for this exists yet on Windows because this issue is still open for resolution. Very frustrating. Last edited by slowsmile; 01-13-2017 at 11:52 PM. |
01-14-2017, 02:40 AM | #41 |
Grand Sorcerer
Posts: 5,582
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
|
@slowsmile: At first glance your code looks OK to me, but then again I'm not a programmer and might have missed something.
It might actually be a user error. For example, you can generate HTML files with LibreOffice via Export and Save as and both produce slightly different output. @bravosx: If you haven't already sent test files via e-mail to slowsmile, please select a small Polish Public Domain text with a couple of headings, process it with LibreOffice and the latest version of the plugin (0.2.7) as usual and attach the .odt file, the exported/saved .html file and the raw epub file generated by the plugin. Also indicate whether you used Save as or Export to generate the HTML file and give some specific examples of code page conversion issues (e.g., does it affect only file names in the Book Browser, TOC entries, metadata information or HTML content files). |
01-14-2017, 02:46 AM | #42 |
Witchman
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
|
bravosx...Your Polish book should convert correctly with the new v0.2.7(just released). It worked fine when I tested it with Doitsu's Polish html ebook. The only problem, which I can do nothing about, is the file names in the Sigil Book Browser - they might look a little strange. But you can always rename them to your liking. But the book contents should now be OK.
|
01-14-2017, 06:56 AM | #43 |
Connoisseur
Posts: 99
Karma: 10
Join Date: Jun 2014
Location: Poland, Żory
Device: Prestigio PER3464B, Onyx Lynx, Lenovo S5000 i Tab4-8"
|
@slowsmile...
I made the LibreOffice convert the file from the link that was sent in by Doitsu in post #16 saving as .odt. It was the output file to further transformations using Save As>Document HTML (Writer)(.html) setting in the Tools tab>Options>Load/Save>HTML Compatibility additional character sets: Western European (Windows-1252/WinLatin1) Central European (Windows-1250/WinLatin2) Unicode UTF-8 In all ways of Polish letters in the book now show up correctly. Only problem is the file contents.xhtml (TOC), where Polish letters disappear completely. @Doitsu I have prepared all the files you requested, I can send but I do not know how to do it. Regards bravosx Last edited by bravosx; 01-14-2017 at 07:14 AM. |
01-14-2017, 07:02 AM | #44 | |
Grand Sorcerer
Posts: 5,582
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
|
Quote:
EDIT: I looked at the files and can confirm bravosx's findings. Apparently the routine to generate safe file names killed Polish letters such as the stroked L in Rozdział. Last edited by Doitsu; 01-14-2017 at 07:38 AM. |
|
01-14-2017, 09:46 AM | #45 | |
Sigil Developer
Posts: 7,506
Karma: 5433350
Join Date: Nov 2009
Device: many
|
more on zip filename encoding
@slowsmile
Quote:
Given the plugin can and should be using the python zip module to create the epub Sigil uses its own zip module to handle things, the fact that the Windows builtin zip utility is broken should not matter. FWIW, you should also make sure that any hrefs used in the content.opf or in links throughout the document are properly url encoded to preserve any non-ascii chars used. Hope his helps, KevinH |
|
Tags |
conversion, epub, html, odf, opendoc |
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
html to epub conversion | andin1 | Conversion | 1 | 03-12-2013 06:38 PM |
Nightmare epub: it's full of tables (conversion from CHM?) | MelBr | Conversion | 2 | 02-23-2013 11:28 AM |
html to epub CLI conversion / html input | m4mmon | Conversion | 2 | 05-05-2012 02:10 AM |
Help with HTML to ePub conversion...? | Nethfel | Calibre | 4 | 05-10-2010 02:26 PM |
Converting ODF to ePub with ODFToEPub | wdonne | News | 0 | 04-22-2010 05:28 AM |