[Plugin] OpenDocHTMLImport - Full ODF HTML(Writer) conversion to epub - Page 2

Doitsu · 01-12-2017, 12:46 PM

Quote:

Originally Posted by bravosx

Unfortunately, this is I do not know. I am a retired mechanical engineer, not a computer programmer.

For testing purposes I converted this Public Domain Polish translation of a Grimm Brothers short story with LO to an HTML file and it imported fine.

Try to import the attached html file with the plugin and enter the following medatada:

Book title: Mądra Elżbieta
Author: Bracia Grimm; Elwira Korotyńska
Publisher: Wydawnictwo Księgarni Popularnej

Then check the metadata entries after the import. If the diacritical characters don't survive the import, open content.opf, copy the complete metadata section, mark all corrupted characters and post it here.

If you can import my test file without problems, maybe your source file is incorrectly encoded.

slowsmile · 01-12-2017, 06:59 PM

@KevinH...In my plugin, I've used Nick Coghlan's advised method (see section: Files in an ASCII compatible encoding, best effort is acceptable). I use a function which initially opens the user html file as a text file in latin-1 using the 'surrogateescape' error handler. This allows me to later fix simple mixed encoding errors caused by windows-1232 or latin-1 that are beyond the ASCII range.

In the function, the html file is then read as a text file using BeautifulSoup and copied straight back out again to replace the original html file in the working directory. Doing this inherently converts the html file to unicode utf-8. I basically rely on BeautifulSoup's in-built and automatic encoding detection using UnicodeDammit to detect and change the html file encoding to unicode utf-8. I've also checked the encoding directly after running this function and the file is always in utf-8. I've also checked the html file after complete conversion and it is always in utf-8.

I confess that I never really considered html text with different languages being used in my plugin. So I'm definitely willing to learn more about this for sure.

slowsmile · 01-12-2017, 07:10 PM

@KevinH...In my plugin, I've used Nick Coghlan's advised method (see section: Files in an ASCII compatible encoding, best effort is acceptable). I use a function which initially opens the user html file as a text file in latin-1 using the 'surrogateescape' error handler. This allows me to later fix simple mixed encoding errors in utf-8 caused by windows-1232 or latin-1 that are beyond the ASCII range.

In the function, the html file is then read as a text file using BeautifulSoup and copied straight back out again to replace the original html file in the working directory. Doing this inherently converts the html file to unicode utf-8. I basically rely on BeautifulSoup's in-built and automatic encoding detection using UnicodeDammit to detect and change the html file encoding to unicode utf-8. I've also checked the encoding directly after running this function and the file is always in utf-8. I've also checked the html file after complete conversion and it is always in utf-8.

In the past I've also tried using chardet with codecs to detect file encoding as you have advised. But I have found this method to be consistently poor and inaccurate.

I also must confess that I never really considered html text with different languages being used in my plugin. So I'm definitely willing to learn more about this for sure.

slowsmile · 01-12-2017, 07:23 PM

@KevinH - I've also just changed my plugin to put the locale language in the XMLNS in the html.

When you open a file in Sigil using the Mend HTML Files on Open preferences option, does Sigil interrogate the HTML XMLNS xml:lang attribute to set the displayed HTML text language?

slowsmile · 01-12-2017, 07:34 PM

@Doitsu...Thanks for your help. I'll try your suggestion and let you know the outcome.

slowsmile · 01-12-2017, 07:50 PM

@Doitsu & @KevinH...Just tried Doitsu's Polish HTML LibreOffice file in my plugin and it displays Polish characters without a problem in Sigil Text View. So my conversion is correctly converting to unicode utf-8. You can also verify this for yourself because I've used the current version of my plugin(v0.2.5) to test the Polish file.

I should also add that I've only just put in a fix(which is in v0.2.5) whereby I now obtain the locale language and insert that into the xml:lang attribute in the XMLNS for every html file in the epub. This might have something to do with why my plugin now works for different language charsets.

If my fix has cured the character set problem then does that also mean that the Sigil app interrogates the XMLNS xml:lang attribute to ascertain the correct language charset whenever you open an epub or html file in Sigil?

UPDATE: I've just checked the Polish epub more thoroughly again and found errors in the contents.xhtml(no TOC), content.opf and toc.ncx files -- that's probably because I did not insert the xml:lang in the XMLNS for those files. I also need to add the correct dc:language to the metadata as well. I think that these errors also more or less confirm that the Sigil app does indeed interrogate the xml:lang attribute. I'll try and put in a fix to confirm this later.

slowsmile · 01-12-2017, 09:06 PM

@KevinH & Doitsu...Just finished testing my plugin with the Polish file. It seems that I was wrong. The xml:lang attribute has nothing to do with setting the language. So I guess the fact that my file was in unicode utf-8 was good enough.

Notably the Polish headings in Sigil's Book Browser display incorrectly but there's nothing I can really do about that perhaps due to severe ASCII restraints on zip file names(uses DOS Latin US on Windows). The Polish user will just have to rename his files in the Book Browser as he prefers.

KevinH · 01-12-2017, 10:44 PM

@slowsmile
zip can and does support utf-8 encoding of text files as either binary or by setting. It a flag in the info field for each entry. Python's zip library handles that just fine. What are you using to zip up your epub's with?

As for encoding detection on import in Sigil, I will look at the code to see how that is handled. That said, afaik, the only correct way to handle an html file in python with unknown encoding is to read it in as bytes and not convert it to string until you have searched the bytes for encoding info in the metadata of the html file or tried to look for patterns that in the bytes to look for byte order marks and specific bytes sequences that rule out one encoding or another. Thisis what Unicode Dammit does (although not well) and libraries like charmap and ccharmap. Reading all in as ascii extended and escaping any encoding is really not a sound strategy as far as I can tell.

KevinH

KevinH

KevinH · 01-12-2017, 10:56 PM

@bravosx

When LibreOffice exports an html file, does it properly set the charset encoding in the text file?

In other words, if you look at the html before importing it, do you see the following line anywhere in the head tag:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

Obviously with potentially other names for the encoding used. If so, what does it say for your problem file before it gets imported into Sigil directly or via this plugin?

KevinH

slowsmile · 01-12-2017, 11:52 PM

On my Windows 8 system, both LibreOffice and OpenOffice export to HTML with the following meta header:

<META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=windows-1252">

But quite how LO and OO exports and encodes to HTML on other os platforms like Linux and OSX is unknown to me. I'm guessing that they probably export as utf-8 on Linux and utf-16 on OSX but not really sure. When I researched UnicodeDammit I found that it would identify widows-1252, latin-1, ISO/IEC 8859-2 and utf-8 without much problems. And while researching UnicodeDammit from bs4 I found out that it also uses the chardet and cchardet modules as well as the codecs module in its routines.

I also take your point about checking for other weird encodings besides the ones that I've mentioned already. Will look into that and try to implement a fix soon.

I would also completely agree with you about zip supporting utf-8 for file contents. But I was really talking about about zip file names. For zip file names I think you'll find that only DOS Latin US charset is allowed.

I'm mainly using 7-Zip and WinRar for the zip files.

bravosx · 01-13-2017, 04:44 AM

@Doitsu... I have uploaded the file prepared by you 'polish.zip' to Sigil and Polish characters are displayed correctly.
I'll try to download the same text from the specified location, save as .html and upload via a plug. See if it will display Polish characters correctly. I'll know the results.

The effect is the same as that already used earlier:
- Text saved as html by setting the Unicode UTF-8 and imported using a plug incorrectly displayed Polish characters.
- Text saved as HTML when set to Western European (Windows-1252 / WinLatin 1) and imported as a plug-Polish characters are displayed correctly.

bravosx

bravosx · 01-13-2017, 06:04 AM

Quote:

Originally Posted by KevinH

@bravosx

When LibreOffice exports an html file, does it properly set the charset encoding in the text file?

In other words, if you look at the html before importing it, do you see the following line anywhere in the head tag:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

Obviously with potentially other names for the encoding used. If so, what does it say for your problem file before it gets imported into Sigil directly or via this plugin?

KevinH

@KevinH
If LibreOffice set your system Tools tab>Options>Load/Save >HTML Compatibility. In the Character set dropdown select UNICODE (UTF-8) and save this line looks like this:

<meta http-equiv="content-type" content="text/html; charset=utf-8"/>

However, if the options change to Western European (Windows-1252 / WinLatin 1) this is the same line looks like this:

<meta http-equiv="content-type" content="text/html; charset=windows-1252"/>

and then the Polish characters are displayed properly.

bravosx

KevinH · 01-13-2017, 07:58 AM

@bravosx
Both versions should load and view properly using File->Open into Sigil and look identical at once loaded. Please confirm that.

bravosx · 01-13-2017, 08:44 AM

Quote:

Originally Posted by KevinH

@bravosx
Both versions should load and view properly using File->Open into Sigil and look identical at once loaded. Please confirm that.

Yes, both versions of the same file opened directly in Sigil display correctly Polish characters.

bravosx

KevinH · 01-13-2017, 09:29 AM

@bravosx,
Thank you. The bug is therefore in how the plugin determines and handles the encoding. It seems to only work properly with Win1252.

@slowsmile - please do revamp your plugin to properly handle encodings if provided in meta element of the html file by reading it in binary (getting bytes) and using re (on bytes) to check for a charset specifier. If one is found, try using that encoding to decode the bytes to a python3 str and when outputting encode it as utf-8 (after removing any now incorrect charset specifiers, or alternatively try using cchardet or chardet to detect and/or confirm your encoding guess. It seems your approach of always reading a file in extended ascii with an error handler set to encode errors does not work as I suspected.

Hope this helps,

KevinH

01-12-2017, 07:50 PM	#21
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	@Doitsu & @KevinH...Just tried Doitsu's Polish HTML LibreOffice file in my plugin and it displays Polish characters without a problem in Sigil Text View. So my conversion is correctly converting to unicode utf-8. You can also verify this for yourself because I've used the current version of my plugin(v0.2.5) to test the Polish file. I should also add that I've only just put in a fix(which is in v0.2.5) whereby I now obtain the locale language and insert that into the xml:lang attribute in the XMLNS for every html file in the epub. This might have something to do with why my plugin now works for different language charsets. If my fix has cured the character set problem then does that also mean that the Sigil app interrogates the XMLNS xml:lang attribute to ascertain the correct language charset whenever you open an epub or html file in Sigil? UPDATE: I've just checked the Polish epub more thoroughly again and found errors in the contents.xhtml(no TOC), content.opf and toc.ncx files -- that's probably because I did not insert the xml:lang in the XMLNS for those files. I also need to add the correct dc:language to the metadata as well. I think that these errors also more or less confirm that the Sigil app does indeed interrogate the xml:lang attribute. I'll try and put in a fix to confirm this later. Last edited by slowsmile; 01-12-2017 at 08:27 PM.

01-12-2017, 09:06 PM	#22
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	@KevinH & Doitsu...Just finished testing my plugin with the Polish file. It seems that I was wrong. The xml:lang attribute has nothing to do with setting the language. So I guess the fact that my file was in unicode utf-8 was good enough. Notably the Polish headings in Sigil's Book Browser display incorrectly but there's nothing I can really do about that perhaps due to severe ASCII restraints on zip file names(uses DOS Latin US on Windows). The Polish user will just have to rename his files in the Book Browser as he prefers. Last edited by slowsmile; 01-12-2017 at 09:25 PM.

01-12-2017, 11:52 PM	#25
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	On my Windows 8 system, both LibreOffice and OpenOffice export to HTML with the following meta header: <META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=windows-1252"> But quite how LO and OO exports and encodes to HTML on other os platforms like Linux and OSX is unknown to me. I'm guessing that they probably export as utf-8 on Linux and utf-16 on OSX but not really sure. When I researched UnicodeDammit I found that it would identify widows-1252, latin-1, ISO/IEC 8859-2 and utf-8 without much problems. And while researching UnicodeDammit from bs4 I found out that it also uses the chardet and cchardet modules as well as the codecs module in its routines. I also take your point about checking for other weird encodings besides the ones that I've mentioned already. Will look into that and try to implement a fix soon. I would also completely agree with you about zip supporting utf-8 for file contents. But I was really talking about about zip file names. For zip file names I think you'll find that only DOS Latin US charset is allowed. I'm mainly using 7-Zip and WinRar for the zip files. Last edited by slowsmile; 01-13-2017 at 12:44 AM.

01-13-2017, 04:44 AM	#26
bravosx Connoisseur Posts: 99 Karma: 10 Join Date: Jun 2014 Location: Poland, Żory Device: Prestigio PER3464B, Onyx Lynx, Lenovo S5000 i Tab4-8"	@Doitsu... I have uploaded the file prepared by you 'polish.zip' to Sigil and Polish characters are displayed correctly. I'll try to download the same text from the specified location, save as .html and upload via a plug. See if it will display Polish characters correctly. I'll know the results. The effect is the same as that already used earlier: - Text saved as html by setting the Unicode UTF-8 and imported using a plug incorrectly displayed Polish characters. - Text saved as HTML when set to Western European (Windows-1252 / WinLatin 1) and imported as a plug-Polish characters are displayed correctly. bravosx Last edited by bravosx; 01-13-2017 at 05:34 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
html to epub conversion	andin1	Conversion	1	03-12-2013 06:38 PM
Nightmare epub: it's full of tables (conversion from CHM?)	MelBr	Conversion	2	02-23-2013 11:28 AM
html to epub CLI conversion / html input	m4mmon	Conversion	2	05-05-2012 02:10 AM
Help with HTML to ePub conversion...?	Nethfel	Calibre	4	05-10-2010 02:26 PM
Converting ODF to ePub with ODFToEPub	wdonne	News	0	04-22-2010 05:28 AM

01-12-2017, 06:59 PM	#17
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	@KevinH...In my plugin, I've used Nick Coghlan's advised method (see section: Files in an ASCII compatible encoding, best effort is acceptable). I use a function which initially opens the user html file as a text file in latin-1 using the 'surrogateescape' error handler. This allows me to later fix simple mixed encoding errors caused by windows-1232 or latin-1 that are beyond the ASCII range. In the function, the html file is then read as a text file using BeautifulSoup and copied straight back out again to replace the original html file in the working directory. Doing this inherently converts the html file to unicode utf-8. I basically rely on BeautifulSoup's in-built and automatic encoding detection using UnicodeDammit to detect and change the html file encoding to unicode utf-8. I've also checked the encoding directly after running this function and the file is always in utf-8. I've also checked the html file after complete conversion and it is always in utf-8. I confess that I never really considered html text with different languages being used in my plugin. So I'm definitely willing to learn more about this for sure.

01-12-2017, 07:10 PM	#18
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	@KevinH...In my plugin, I've used Nick Coghlan's advised method (see section: Files in an ASCII compatible encoding, best effort is acceptable). I use a function which initially opens the user html file as a text file in latin-1 using the 'surrogateescape' error handler. This allows me to later fix simple mixed encoding errors in utf-8 caused by windows-1232 or latin-1 that are beyond the ASCII range. In the function, the html file is then read as a text file using BeautifulSoup and copied straight back out again to replace the original html file in the working directory. Doing this inherently converts the html file to unicode utf-8. I basically rely on BeautifulSoup's in-built and automatic encoding detection using UnicodeDammit to detect and change the html file encoding to unicode utf-8. I've also checked the encoding directly after running this function and the file is always in utf-8. I've also checked the html file after complete conversion and it is always in utf-8. In the past I've also tried using chardet with codecs to detect file encoding as you have advised. But I have found this method to be consistently poor and inaccurate. I also must confess that I never really considered html text with different languages being used in my plugin. So I'm definitely willing to learn more about this for sure.

01-12-2017, 07:23 PM	#19
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	@KevinH - I've also just changed my plugin to put the locale language in the XMLNS in the html. When you open a file in Sigil using the Mend HTML Files on Open preferences option, does Sigil interrogate the HTML XMLNS xml:lang attribute to set the displayed HTML text language?

01-12-2017, 07:34 PM	#20
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	@Doitsu...Thanks for your help. I'll try your suggestion and let you know the outcome.

01-12-2017, 10:44 PM	#23
KevinH Sigil Developer Posts: 7,636 Karma: 5433388 Join Date: Nov 2009 Device: many	@slowsmile zip can and does support utf-8 encoding of text files as either binary or by setting. It a flag in the info field for each entry. Python's zip library handles that just fine. What are you using to zip up your epub's with? As for encoding detection on import in Sigil, I will look at the code to see how that is handled. That said, afaik, the only correct way to handle an html file in python with unknown encoding is to read it in as bytes and not convert it to string until you have searched the bytes for encoding info in the metadata of the html file or tried to look for patterns that in the bytes to look for byte order marks and specific bytes sequences that rule out one encoding or another. Thisis what Unicode Dammit does (although not well) and libraries like charmap and ccharmap. Reading all in as ascii extended and escaping any encoding is really not a sound strategy as far as I can tell. KevinH KevinH

01-12-2017, 10:56 PM	#24
KevinH Sigil Developer Posts: 7,636 Karma: 5433388 Join Date: Nov 2009 Device: many	@bravosx When LibreOffice exports an html file, does it properly set the charset encoding in the text file? In other words, if you look at the html before importing it, do you see the following line anywhere in the head tag: <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> Obviously with potentially other names for the encoding used. If so, what does it say for your problem file before it gets imported into Sigil directly or via this plugin? KevinH

01-13-2017, 07:58 AM	#28
KevinH Sigil Developer Posts: 7,636 Karma: 5433388 Join Date: Nov 2009 Device: many	@bravosx Both versions should load and view properly using File->Open into Sigil and look identical at once loaded. Please confirm that.

01-13-2017, 09:29 AM	#30
KevinH Sigil Developer Posts: 7,636 Karma: 5433388 Join Date: Nov 2009 Device: many	@bravosx, Thank you. The bug is therefore in how the plugin determines and handles the encoding. It seems to only work properly with Win1252. @slowsmile - please do revamp your plugin to properly handle encodings if provided in meta element of the html file by reading it in binary (getting bytes) and using re (on bytes) to check for a charset specifier. If one is found, try using that encoding to decode the bytes to a python3 str and when outputting encode it as utf-8 (after removing any now incorrect charset specifiers, or alternatively try using cchardet or chardet to detect and/or confirm your encoding guess. It seems your approach of always reading a file in extended ascii with an error handler set to encode errors does not work as I suspected. Hope this helps, KevinH

Advert

Advert