MobileRead Forums - View Single Post - [Plugin] OpenDocHTMLImport

KevinH · 01-12-2017, 11:44 PM

@slowsmile
zip can and does support utf-8 encoding of text files as either binary or by setting. It a flag in the info field for each entry. Python's zip library handles that just fine. What are you using to zip up your epub's with?

As for encoding detection on import in Sigil, I will look at the code to see how that is handled. That said, afaik, the only correct way to handle an html file in python with unknown encoding is to read it in as bytes and not convert it to string until you have searched the bytes for encoding info in the metadata of the html file or tried to look for patterns that in the bytes to look for byte order marks and specific bytes sequences that rule out one encoding or another. Thisis what Unicode Dammit does (although not well) and libraries like charmap and ccharmap. Reading all in as ascii extended and escaping any encoding is really not a sound strategy as far as I can tell.

KevinH

KevinH

01-12-2017, 11:44 PM	#23
KevinH Sigil Developer Posts: 9,409 Karma: 6733754 Join Date: Nov 2009 Device: many	@slowsmile zip can and does support utf-8 encoding of text files as either binary or by setting. It a flag in the info field for each entry. Python's zip library handles that just fine. What are you using to zip up your epub's with? As for encoding detection on import in Sigil, I will look at the code to see how that is handled. That said, afaik, the only correct way to handle an html file in python with unknown encoding is to read it in as bytes and not convert it to string until you have searched the bytes for encoding info in the metadata of the html file or tried to look for patterns that in the bytes to look for byte order marks and specific bytes sequences that rule out one encoding or another. Thisis what Unicode Dammit does (although not well) and libraries like charmap and ccharmap. Reading all in as ascii extended and escaping any encoding is really not a sound strategy as far as I can tell. KevinH KevinH