[Plugin] OpenDocHTMLImport - Full ODF HTML(Writer) conversion to epub - Page 4

slowsmile · 01-14-2017, 10:19 AM

bravosx...Thank you for sending those files which is a great help.

After looking at your files here are the conclusions I arrived at -- And to keep it short, I'm only going describe what I found in your Bracia Grim_Cp-1250Latin2_Save as.html file:

* Your ebook only consists of 5 pages. My plugin is really meant for 200-300 page ebooks with or without images.

* Alot of your files for conversion had '.xhtml' extensions. They should all have '.html' extensions when exported from LibreOffice as html. Please don't use files with '.xhtml' extensions with the plugin.

* When I looked in your Bracia Grim ODT file in Options > Load/Save > HTML Compatibity the Character set the had been set to "Big5" for some unknown reason. In fact, some of your html files that you sent me had also "Big5" set as their html character set. Why are you using a traditional Chinese character set for your Polish html file? In your LibreOffice application, please be sure to set your charset back to Windows-1250/Latin2 which is the correct character set that you should be using for the Polish language.

* On conversion to epub with my plugin both the title and the heading were correctly found and the epub contents.xhtml file was populated with just one toc item which is correct behaviour for the plugin with your 5 page ebook with one heading.

* All the xhtml fies that you sent me all had no DOCTYPE and no XMLNS headers - both were missing. That means that they will even fail when you try to view them in Chrome browser. Never use xhtml files in the plugin -- they have a completely different layout and format compared to '.html' files.

* Despite the above charset and xhtml problems(which I didn't change or alter), when I converted your Bracia Grim_Cp_1250Latin2_Save as.html file to epub using my plugin, the correct charset -- Windows-1250/Latin2 -- was found by the plugin and the epub file also used the correct Polish charset with all ligatures and glyphs present and showing in Text View in Sigil. This file also passed EpubCheck first time and when I converted this file to Kindle using Kindle Previewer 3.7 it converted without any problems and the Kindle displayed properly in the Polish language.

As proof of the above I've sent you both the epub and Kindle mobi version of your unchanged Bracia Grim_Cp-1250Latin2_Save as.epub file. Please note that the Kindle version of your ebook also seems to display the Polish text correctly and so must also be using the correct windows-1250/latin2 charset.

See attachments below.

Doitsu · 01-14-2017, 11:39 AM

@slowsmile: It might be a bit difficult to spot at first glance if you don't know what to look for, but even in the epub that you generated TOC entries in the NCX TOC are missing Polish national characters.

For example, the chapter title of the first chapter is Rozdział 1. (Note the stroked L before the 1.) However this special character is missing in TOC.NCX:

Code:

    <navPoint id="navPoint-3" playOrder="3">
      <navLabel>
        <text>Rozdzia 1</text>
      </navLabel>
      <content src="Text/rozdzia_1.xhtml"/>
    </navPoint>

It should read:

Code:

    <navPoint id="navPoint-3" playOrder="3">
      <navLabel>
        <text> Rozdział 1</text>
      </navLabel>
      <content src="Text/rozdzia_1.xhtml"/>
    </navPoint>

I.e., there's a bug in the TOC generation code.

@bravosx: As a temporary fix, you could simply regenerate the TOC via CTRL+T. This should restore the missing characters since the actual headings contain them.

bravosx · 01-14-2017, 12:31 PM

Quote:

@bravosx: As a temporary fix, you could simply regenerate the TOC via CTRL+T. This should restore the missing characters since the actual headings contain them.

OK, actually this way it is possible to improve the TOC and seems to be fairly easy. Thanks for the tip.

The problem remains the text contained between the <title></title>, for example, the lack of Polish character in the word Rozdział:

Code:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
  "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

<html xml:lang="pl-PL" xmlns="http://www.w3.org/1999/xhtml">
<head>
  <title>Rozdzia_1</title>
  <link href="../Styles/stylesheet.css" rel="stylesheet" type="text/css"/>
</head>

Thanks, bravosx

st_albert · 01-14-2017, 01:45 PM

Just for a little change of pace, here's a problem I'm having.

I started with a LibreOffice .odt file containing a novel I'm working on. It has several custom styles, which all "inherit from" Text Body.

LO version is 5.1.4.2 on Kubuntu 16.04.1
plugin version is 0.2.7
Sigil version is 0.9.7

The writer HTML file was created via "save as" and selecting HTML document (writer) as the format.

The plugin runs without error, and correctly imports the cover image, builds a correct toc.ncx and a correct HTML contents file, but no text files are created after the frontmatter. That is to say the frontmatter (title, copyright, and dedication) is included, but nothing is included from the first h1 tag on. The TOC refers to flies like "../Text/chapter_one.xhtml" and so on, but those files are not present.

Must be something I'm doing wrong, or someone surely would have mentioned it before now.

Here's the import log, in case it is helpful:

Spoiler:

Thanks for any pointers.

Albert

Edited to add: Just tried it on another machine with LO 5.2.3.3, and other software versions same as stated above, and got the same result.

Doitsu · 01-14-2017, 03:41 PM

Quote:

Originally Posted by bravosx

The problem remains the text contained between the <title></title>, for example, the lack of Polish character in the word Rozdział:

The <title>...</title> tag isn't used by epub apps. I.e., it can be empty or contain random characters. It only needs to be included for backwards compatibility with older apps; you also get the following epubcheck error, if it isn't included:

Code:

ERROR(RSC-005): Error while parsing file 'element "head" incomplete; missing required element "title"'.

Quote:

Originally Posted by st_albert

The plugin runs without error, and correctly imports the cover image, builds a correct toc.ncx and a correct HTML contents file, but no text files are created after the frontmatter.

When I ran tests I also encountered similar problems with some files, but I chalked it down to LibreOffice/OS compatibility problems, and since I don't really need this plugin, I didn't investigate this further.

bravosx · 01-14-2017, 04:16 PM

@Doitsu...

Quote:

Originally Posted by Doitsu

The <title>...</title> tag isn't used by epub apps. I.e., it can be empty or contain random characters. It only needs to be included for backwards compatibility with older apps; you also get the following epubcheck error, if it isn't included:

Code:

ERROR(RSC-005): Error while parsing file 'element "head" incomplete; missing required element "title"'.

OK. Thanks for the explanation of the problem.

Regards bravosx

slowsmile · 01-14-2017, 06:41 PM

@Doitsu...You are correct in saying that both the TOC names and file names in the Book Browser are not dispalying with Polish characters. I derive these file names directly from the ebook file as utf-8. But unfortunately -- due to pythonissue 27344 -- zip file names on Windows can only use DOS Latin and not utf8.

And since I derive both the xhtml file names and content.xhtml toc items in the same way I am unable to fix these problems at the moment. Regarding the content.xhtml toc names, I'm still looking into how I can generate the Polish names for the toc. One way to solve both the toc and file names problem would perhaps be -- as KevinH has suggested -- to generate the epub zip file by simply using the epub_zip_up_book_contents(ebook_path, epub_filepath) in Sigil's plugin utilities. This will be a major change requiring much testing, so I think I'll do that after the plugin has settled a bit regarding other more minor errors.

slowsmile · 01-14-2017, 07:04 PM

This is a Duplicate.

DiapDealer · 01-14-2017, 07:34 PM

Quote:

Originally Posted by slowsmile

But unfortunately -- due to pythonissue 27344 -- zip file names on Windows can only use DOS Latin and not utf8.

As KevinH pointed out above, this is simply not true. The bug you're pointing to is a documentation issue only. Python's zipfile module from 2.7 on is perfectly capable of handling utf-8 filenames on Windows -- as is Sigil's internal (un)zip routines. Winzip and PKZip are the programs that are limited to DOS Latin file names on Windows. So unless you're telling us that you're using Winzip or PKZip as part of your plugin instead of Python's zipfile module (and I really hope you're not), the Python documentation issue you're pointing to just isn't relevant here.

slowsmile · 01-14-2017, 07:42 PM

@st_albert...On conversion to epub, the individual chapters are selected and a file split occurs in the html file if they are correctly formatted and styled with "Heading 1" style(h1) in OO or LO. Your h1 headers can either be directly styled with h1 or your named style can also be linked with h1 to be selected. Your "Heading 1" style in OO or LO should always be linked with "Heading" style in the Styles Organizer. Do not link h1 style with "Default", "Text Body" or any other style in OO or LO.

If that doesn't cure your problem, could you please attach your html file and attach the equivalent in an ODT file in your next post showing just the problem area -- consisting of just one complete chapter with heading -- in your next post so I can have a look at the formatting?

I suspect that your h1 style has been set up in the wrong way in OO or LO regarding inheritance.

Just to also mention that you should be linking all your own named text styles(text styles only, not headings styles) with the "Text Body" style. Doing this will ensure that all your own named text styles or classes will appear in the epub html in Sigil. So doing this prevents all your inline text styles from automatically being converted to meaningless prefixed/indexed style names like ebk-3, ebk-12, ebk-23 etc.

slowsmile · 01-14-2017, 07:54 PM

@DiapDealer...Point taken regarding the python file names issue. Now looking into using epub_zip_up_book_contents from plugin utilities to create the zip file with utf-8 file names and file contents as suggested.

Regarding file names in the Sigil Book Browser after creating the zip file and epub using the above utility function -- are the file names in the Book Browser always indexed and in English?

Will the above utility also create a contents.xhtml file with the toc item names and toc heading in the correct locale language using the correct charset? Or is this governed by Sigil's Language settings in Preferences?

I'm also not using WinZip or PKZip to zip up the files. I'm mainly using WinRaR and 7-Zip.

KevinH · 01-14-2017, 08:38 PM

The order of filenames is determined by the order provided in the opf spine.

That utility method can be found in epub_utils.py here:

https://github.com/Sigil-Ebook/Sigil.../epub_utils.py

It assumes you have built a proper unpacked epub at ebook_path (directory where the mimetype file exists) and simply creates the zip from it properly special casing the mimetype file.

There are also helper routines to create a container.xml, deal with obfuscating fints if needed, etc.

Hope this helps,

Kevin

slowsmile · 01-14-2017, 11:09 PM

@Kevin...Thanks for that advice on the plugin utils. I've already been investigating these utils and the change to using the epub_zip_up_book_contents and other utils certainly seems workable. My own utils in the plugin can handle the rest I think.

I'm really just trying to minimize the hit on the plugin by making such radical changes. Currently the function that splits the html file into separate xhtml files is quite complex and does somewhat more than just split the files(it also helps to create the doc TOC and Nav TOC and creates the title page as well). If and when I do implement these major changes, I think it will be later rather than sooner because I would prefer to first iron out any other minor error problems that arise with the plugin already out there before implementing such a major change.

slowsmile · 01-15-2017, 03:30 AM

@Kevin & @DiapDealer...I've got the epub_zip_up_book_contents() working in the plugin and when it converts the Polish ebook -- Brassia Grim -- to epub, the epub is exactly the same as before -- the file names displayed in the epub in Sigil's Book Browser are DOS Latin and not in UTF8 encoding showing Polish characters. The contents.xhtml toc items are also exactly the same as before.

I must also add that at no point in my plugin app do I handle read/writes to and from files in anything else but UTF8.

And in my desperate trawlings for more information about zip files on the internet I stumbled across what might be a rather large gorilla in the room.

I found out that the Windows NTFS file system(used on Windows 7, 8 & 10) uses UTF16 for all file names.

So here's another question:

Can python's ZipInfo object and flag bits be set to allow a UTF16 NTFS file name to be added to a zip file as UTF8? Or will the UTF16 filename automatically always revert to DOS Latin encoding instead in the zip archive? I'm asking this question because when I checked WinRaR's ability to change internal file name encoding by going to Options > Name Encoding in the app, there was no UTF16 option -- only UTF8.

Lastly, I'm quite open and willing to believe that python's ZipFile module can convert and store UTF8 file names, but as yet I have seen no evidence of this happening either in my module or after using the epub_zip_up_book_contents() function from the PLugin Framework. It also hasn't helped that Python's documentation appears to be absolutely nil concerning proper detailed descriptions of what the zipinfo flag bits do and how to use them. I'm now off to try and perhaps find some decent and reliable flag bit code from Nullege or Git Hub and the like.

Doitsu · 01-15-2017, 08:45 AM

@slowsmile: The built-in epub_zip_up_book_contents() function has absolutely no problems with non-ASCII file names.

I've written a quick and dirty proof of concept input plugin that demonstrates this feature.

Here's the code:

Spoiler:

To test it unpack the attached test.epub file, which contains two HTML files with accented characters and umlauts (äöüß.xhtml and âîïéêë.xhtml).

Then install the new junk plugin, run it, select the folder that you unpacked test.epub to, and click Yes to import the files.

Note that epubcheck will complain about file names that contain non-ASCII characters and spaces. I.e., even though you could theoretically use file names with non-ASCII characters I'd strongly advise against it.

01-14-2017, 11:39 AM	#47
Doitsu Grand Sorcerer Posts: 5,584 Karma: 22735033 Join Date: Dec 2010 Device: Kindle PW2	@slowsmile: It might be a bit difficult to spot at first glance if you don't know what to look for, but even in the epub that you generated TOC entries in the NCX TOC are missing Polish national characters. For example, the chapter title of the first chapter is Rozdział 1. (Note the stroked L before the 1.) However this special character is missing in TOC.NCX: Code: <navPoint id="navPoint-3" playOrder="3"> <navLabel> <text>Rozdzia 1</text> </navLabel> <content src="Text/rozdzia_1.xhtml"/> </navPoint> It should read: Code: <navPoint id="navPoint-3" playOrder="3"> <navLabel> <text> Rozdział 1</text> </navLabel> <content src="Text/rozdzia_1.xhtml"/> </navPoint> I.e., there's a bug in the TOC generation code. @bravosx: As a temporary fix, you could simply regenerate the TOC via CTRL+T. This should restore the missing characters since the actual headings contain them.

01-14-2017, 01:45 PM	#49
st_albert Guru Posts: 696 Karma: 150000 Join Date: Feb 2010 Device: none	Just for a little change of pace, here's a problem I'm having. I started with a LibreOffice .odt file containing a novel I'm working on. It has several custom styles, which all "inherit from" Text Body. LO version is 5.1.4.2 on Kubuntu 16.04.1 plugin version is 0.2.7 Sigil version is 0.9.7 The writer HTML file was created via "save as" and selecting HTML document (writer) as the format. The plugin runs without error, and correctly imports the cover image, builds a correct toc.ncx and a correct HTML contents file, but no text files are created after the frontmatter. That is to say the frontmatter (title, copyright, and dedication) is included, but nothing is included from the first h1 tag on. The TOC refers to flies like "../Text/chapter_one.xhtml" and so on, but those files are not present. Must be something I'm doing wrong, or someone surely would have mentioned it before now. Here's the import log, in case it is helpful: Spoiler: Status: success Python version: 3.5.2 (default, Nov 17 2016, 17:05:23) [GCC 5.4.0 20160609] Running OpenDocHTMLImport... -- User input validation checks... -- Main html file found...PASS -- eBook cover file found...PASS -- Input file validation checks... -- Input html file is in OpenDoc HTML format...PASS -- "Heading 1" style is used in the input html file....PASS -- Start conversion to epub... -- Gathering metadata... -- Input file name = /home/u838190/tmp/scratch-epub/GalacticFrontiers_HTMLtest.html -- Author name = Darrel Bain -- Title = Galactic Frontiers -- Cover file name = 9781606193709.jpg -- Found 1 ebook images in your local dir >>> html enc...utf-8 >>> chardet enc...ascii -- Input file encoding is: UTF-8 -- Convert input file to utf-8 if required -- Reformat and remove garbage from html styles... -- Clean, fix and sanitize html garbage code... -- Fix mixed encoding errors -- Remove adhoc garbage code... -- Remove all extraneous text spaces -- Remove all hard line breaks(<br/>) -- Remove all tab spaces -- Remove all "dir", "lang", "name", "id", "align" and "link" attributes -- Remove all anchors, bookmarks and page links -- Remove all proprietary garbage code from the html file -- Preserve and keep all external internet links -- Remove all internal page links -- Remove all line-height and font family declarations -- Remove all isolated </p> tags and </span> tags -- Remove div tags -- Remove all page-break refs in styles -- Cleanup punctuation... -- Change dumb quotes to curly quotes -- Convert triple periods to ellipsis -- Remove the doc TOC if present -- Create the stylesheet... -- Creating the CSS file -- Reformat the CSS File -- Reformat and insert ebook images -- Move HTML inline styles to CSS -- Split all chapters/headers into separate xhtml files -- Add meta headers to all the new html header files -- Normalize the CSS file... -- Remove unwanted attributes from the CSS -- Remove adhoc garbage from the CSS -- Add useful and helpful globals and presets to CSS -- Adjust CSS body attributes -- Convert absolute to relative values in the CSS -- Build the zip archive... -- Creating the cover image file -- Create the TOC file -- Create the container XML file -- Create the toc.ncx XML file -- Build the content.opf file -- Add the Go To guides for toc, cover and begin read. -- Tidy up and indent all XML layouts -- Create the zip file archive... -- Adding files to the new zip file... -- Add mimetype file -- Add container.xml file -- Add toc.ncx file -- Add content.opf file -- Add ebook cover file -- Add content.xhtml file -- Add stylesheet file -- Add cover image file -- Add all ebook image files to the zip Images directory -- Add all HTML text header files to the zip Text directory -- Converting the zip archive to epub format... -- An epub was SUCCESSFULLY created Thanks for any pointers. Albert Edited to add: Just tried it on another machine with LO 5.2.3.3, and other software versions same as stated above, and got the same result. Last edited by st_albert; 01-14-2017 at 05:36 PM. Reason: additional test

01-14-2017, 06:41 PM	#52
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	@Doitsu...You are correct in saying that both the TOC names and file names in the Book Browser are not dispalying with Polish characters. I derive these file names directly from the ebook file as utf-8. But unfortunately -- due to pythonissue 27344 -- zip file names on Windows can only use DOS Latin and not utf8. And since I derive both the xhtml file names and content.xhtml toc items in the same way I am unable to fix these problems at the moment. Regarding the content.xhtml toc names, I'm still looking into how I can generate the Polish names for the toc. One way to solve both the toc and file names problem would perhaps be -- as KevinH has suggested -- to generate the epub zip file by simply using the epub_zip_up_book_contents(ebook_path, epub_filepath) in Sigil's plugin utilities. This will be a major change requiring much testing, so I think I'll do that after the plugin has settled a bit regarding other more minor errors. Last edited by slowsmile; 01-14-2017 at 11:31 PM.

01-14-2017, 07:04 PM	#53
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	This is a Duplicate. Last edited by slowsmile; 01-14-2017 at 07:49 PM.

01-14-2017, 07:42 PM	#55
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	@st_albert...On conversion to epub, the individual chapters are selected and a file split occurs in the html file if they are correctly formatted and styled with "Heading 1" style(h1) in OO or LO. Your h1 headers can either be directly styled with h1 or your named style can also be linked with h1 to be selected. Your "Heading 1" style in OO or LO should always be linked with "Heading" style in the Styles Organizer. Do not link h1 style with "Default", "Text Body" or any other style in OO or LO. If that doesn't cure your problem, could you please attach your html file and attach the equivalent in an ODT file in your next post showing just the problem area -- consisting of just one complete chapter with heading -- in your next post so I can have a look at the formatting? I suspect that your h1 style has been set up in the wrong way in OO or LO regarding inheritance. Just to also mention that you should be linking all your own named text styles(text styles only, not headings styles) with the "Text Body" style. Doing this will ensure that all your own named text styles or classes will appear in the epub html in Sigil. So doing this prevents all your inline text styles from automatically being converted to meaningless prefixed/indexed style names like ebk-3, ebk-12, ebk-23 etc. Last edited by slowsmile; 01-14-2017 at 11:33 PM.

01-14-2017, 07:54 PM	#56
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	@DiapDealer...Point taken regarding the python file names issue. Now looking into using epub_zip_up_book_contents from plugin utilities to create the zip file with utf-8 file names and file contents as suggested. Regarding file names in the Sigil Book Browser after creating the zip file and epub using the above utility function -- are the file names in the Book Browser always indexed and in English? Will the above utility also create a contents.xhtml file with the toc item names and toc heading in the correct locale language using the correct charset? Or is this governed by Sigil's Language settings in Preferences? I'm also not using WinZip or PKZip to zip up the files. I'm mainly using WinRaR and 7-Zip. Last edited by slowsmile; 01-14-2017 at 08:40 PM.

01-14-2017, 08:38 PM	#57
KevinH Sigil Developer Posts: 7,644 Karma: 5433388 Join Date: Nov 2009 Device: many	The order of filenames is determined by the order provided in the opf spine. That utility method can be found in epub_utils.py here: https://github.com/Sigil-Ebook/Sigil.../epub_utils.py It assumes you have built a proper unpacked epub at ebook_path (directory where the mimetype file exists) and simply creates the zip from it properly special casing the mimetype file. There are also helper routines to create a container.xml, deal with obfuscating fints if needed, etc. Hope this helps, Kevin

01-14-2017, 11:09 PM	#58
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	@Kevin...Thanks for that advice on the plugin utils. I've already been investigating these utils and the change to using the epub_zip_up_book_contents and other utils certainly seems workable. My own utils in the plugin can handle the rest I think. I'm really just trying to minimize the hit on the plugin by making such radical changes. Currently the function that splits the html file into separate xhtml files is quite complex and does somewhat more than just split the files(it also helps to create the doc TOC and Nav TOC and creates the title page as well). If and when I do implement these major changes, I think it will be later rather than sooner because I would prefer to first iron out any other minor error problems that arise with the plugin already out there before implementing such a major change. Last edited by slowsmile; 01-14-2017 at 11:26 PM.

01-15-2017, 03:30 AM	#59
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	@Kevin & @DiapDealer...I've got the epub_zip_up_book_contents() working in the plugin and when it converts the Polish ebook -- Brassia Grim -- to epub, the epub is exactly the same as before -- the file names displayed in the epub in Sigil's Book Browser are DOS Latin and not in UTF8 encoding showing Polish characters. The contents.xhtml toc items are also exactly the same as before. I must also add that at no point in my plugin app do I handle read/writes to and from files in anything else but UTF8. And in my desperate trawlings for more information about zip files on the internet I stumbled across what might be a rather large gorilla in the room. I found out that the Windows NTFS file system(used on Windows 7, 8 & 10) uses UTF16 for all file names. So here's another question: Can python's ZipInfo object and flag bits be set to allow a UTF16 NTFS file name to be added to a zip file as UTF8? Or will the UTF16 filename automatically always revert to DOS Latin encoding instead in the zip archive? I'm asking this question because when I checked WinRaR's ability to change internal file name encoding by going to Options > Name Encoding in the app, there was no UTF16 option -- only UTF8. Lastly, I'm quite open and willing to believe that python's ZipFile module can convert and store UTF8 file names, but as yet I have seen no evidence of this happening either in my module or after using the epub_zip_up_book_contents() function from the PLugin Framework. It also hasn't helped that Python's documentation appears to be absolutely nil concerning proper detailed descriptions of what the zipinfo flag bits do and how to use them. I'm now off to try and perhaps find some decent and reliable flag bit code from Nullege or Git Hub and the like.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
html to epub conversion	andin1	Conversion	1	03-12-2013 06:38 PM
Nightmare epub: it's full of tables (conversion from CHM?)	MelBr	Conversion	2	02-23-2013 11:28 AM
html to epub CLI conversion / html input	m4mmon	Conversion	2	05-05-2012 02:10 AM
Help with HTML to ePub conversion...?	Nethfel	Calibre	4	05-10-2010 02:26 PM
Converting ODF to ePub with ODFToEPub	wdonne	News	0	04-22-2010 05:28 AM

Advert

Advert