MobileRead Forums - View Single Post - [Plugin] OpenDocHTMLImport

slowsmile · 01-13-2017, 10:55 AM

@KevinH...I've now made the encoding checks much more strict, accurate and wider as you have advised. I now do a double check of the encoding by first reading the html charset meta tag encoding and comparing it to a list of encodings. Then the file is read in again as binary or bytes and it's checked with chardet. Then I compare the chardet encoding result and html tag encoding result and if they're equal I use it but if the chardet encoding result does not agree with the html meta tag encoding result then I always use the html tag encoding instead. This seems to work pretty well.

For the meta tag encoding, I look for about 20 different encodings - Western European, Baltic, Slavic, Cyrillic, US etc. Generally I've used most of the Windows cps and iso-8899 cps for the meta tag encoding comparisons. I had to do this because the chardet encoding results were so awful and inaccurate. Might try UnicodeDammit tomorrow to see if it gives better results. And after the proper encoding is found, I can then safely convert the html file to utf-8 as required.

But I have to say that I'm really pleased with the overall results. I have one html file that always gives masses of mixed encoding errors after plugin conversion in the epub. I ran that file through the plugin with the new encoding checker function and it flew through and loaded into Sigil with no encoding errors and passed EpubCheck first go directly after conversion. Couldn't believe it. Really glad I took your advice because it's made such a heck of a difference to the conversions.

Still testing the new encoding checker function, no problems so far. Will try testing a UnicodeDammit/detwingle version tomorrow and let you know how it turns out.

My thanks also for all your advice above. I will try solving the file name problem later.

01-13-2017, 10:55 AM	#32
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	@KevinH...I've now made the encoding checks much more strict, accurate and wider as you have advised. I now do a double check of the encoding by first reading the html charset meta tag encoding and comparing it to a list of encodings. Then the file is read in again as binary or bytes and it's checked with chardet. Then I compare the chardet encoding result and html tag encoding result and if they're equal I use it but if the chardet encoding result does not agree with the html meta tag encoding result then I always use the html tag encoding instead. This seems to work pretty well. For the meta tag encoding, I look for about 20 different encodings - Western European, Baltic, Slavic, Cyrillic, US etc. Generally I've used most of the Windows cps and iso-8899 cps for the meta tag encoding comparisons. I had to do this because the chardet encoding results were so awful and inaccurate. Might try UnicodeDammit tomorrow to see if it gives better results. And after the proper encoding is found, I can then safely convert the html file to utf-8 as required. But I have to say that I'm really pleased with the overall results. I have one html file that always gives masses of mixed encoding errors after plugin conversion in the epub. I ran that file through the plugin with the new encoding checker function and it flew through and loaded into Sigil with no encoding errors and passed EpubCheck first go directly after conversion. Couldn't believe it. Really glad I took your advice because it's made such a heck of a difference to the conversions. Still testing the new encoding checker function, no problems so far. Will try testing a UnicodeDammit/detwingle version tomorrow and let you know how it turns out. My thanks also for all your advice above. I will try solving the file name problem later. Last edited by slowsmile; 01-13-2017 at 11:29 AM.