|
|
Thread Tools | Search this Thread |
01-14-2017, 10:19 AM | #46 |
Witchman
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
|
bravosx...Thank you for sending those files which is a great help.
After looking at your files here are the conclusions I arrived at -- And to keep it short, I'm only going describe what I found in your Bracia Grim_Cp-1250Latin2_Save as.html file: * Your ebook only consists of 5 pages. My plugin is really meant for 200-300 page ebooks with or without images. * Alot of your files for conversion had '.xhtml' extensions. They should all have '.html' extensions when exported from LibreOffice as html. Please don't use files with '.xhtml' extensions with the plugin. * When I looked in your Bracia Grim ODT file in Options > Load/Save > HTML Compatibity the Character set the had been set to "Big5" for some unknown reason. In fact, some of your html files that you sent me had also "Big5" set as their html character set. Why are you using a traditional Chinese character set for your Polish html file? In your LibreOffice application, please be sure to set your charset back to Windows-1250/Latin2 which is the correct character set that you should be using for the Polish language. * On conversion to epub with my plugin both the title and the heading were correctly found and the epub contents.xhtml file was populated with just one toc item which is correct behaviour for the plugin with your 5 page ebook with one heading. * All the xhtml fies that you sent me all had no DOCTYPE and no XMLNS headers - both were missing. That means that they will even fail when you try to view them in Chrome browser. Never use xhtml files in the plugin -- they have a completely different layout and format compared to '.html' files. * Despite the above charset and xhtml problems(which I didn't change or alter), when I converted your Bracia Grim_Cp_1250Latin2_Save as.html file to epub using my plugin, the correct charset -- Windows-1250/Latin2 -- was found by the plugin and the epub file also used the correct Polish charset with all ligatures and glyphs present and showing in Text View in Sigil. This file also passed EpubCheck first time and when I converted this file to Kindle using Kindle Previewer 3.7 it converted without any problems and the Kindle displayed properly in the Polish language. As proof of the above I've sent you both the epub and Kindle mobi version of your unchanged Bracia Grim_Cp-1250Latin2_Save as.epub file. Please note that the Kindle version of your ebook also seems to display the Polish text correctly and so must also be using the correct windows-1250/latin2 charset. See attachments below. Last edited by slowsmile; 01-14-2017 at 10:35 AM. |
01-14-2017, 11:39 AM | #47 |
Grand Sorcerer
Posts: 5,584
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
|
@slowsmile: It might be a bit difficult to spot at first glance if you don't know what to look for, but even in the epub that you generated TOC entries in the NCX TOC are missing Polish national characters.
For example, the chapter title of the first chapter is Rozdział 1. (Note the stroked L before the 1.) However this special character is missing in TOC.NCX: Code:
<navPoint id="navPoint-3" playOrder="3"> <navLabel> <text>Rozdzia 1</text> </navLabel> <content src="Text/rozdzia_1.xhtml"/> </navPoint> Code:
<navPoint id="navPoint-3" playOrder="3">
<navLabel>
<text> Rozdział 1</text>
</navLabel>
<content src="Text/rozdzia_1.xhtml"/>
</navPoint>
@bravosx: As a temporary fix, you could simply regenerate the TOC via CTRL+T. This should restore the missing characters since the actual headings contain them. |
Advert | |
|
01-14-2017, 12:31 PM | #48 | |
Connoisseur
Posts: 99
Karma: 10
Join Date: Jun 2014
Location: Poland, Żory
Device: Prestigio PER3464B, Onyx Lynx, Lenovo S5000 i Tab4-8"
|
Quote:
The problem remains the text contained between the <title></title>, for example, the lack of Polish character in the word Rozdział: Code:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xml:lang="pl-PL" xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Rozdzia_1</title>
<link href="../Styles/stylesheet.css" rel="stylesheet" type="text/css"/>
</head>
|
|
01-14-2017, 01:45 PM | #49 |
Guru
Posts: 696
Karma: 150000
Join Date: Feb 2010
Device: none
|
Just for a little change of pace, here's a problem I'm having.
I started with a LibreOffice .odt file containing a novel I'm working on. It has several custom styles, which all "inherit from" Text Body. LO version is 5.1.4.2 on Kubuntu 16.04.1 plugin version is 0.2.7 Sigil version is 0.9.7 The writer HTML file was created via "save as" and selecting HTML document (writer) as the format. The plugin runs without error, and correctly imports the cover image, builds a correct toc.ncx and a correct HTML contents file, but no text files are created after the frontmatter. That is to say the frontmatter (title, copyright, and dedication) is included, but nothing is included from the first h1 tag on. The TOC refers to flies like "../Text/chapter_one.xhtml" and so on, but those files are not present. Must be something I'm doing wrong, or someone surely would have mentioned it before now. Here's the import log, in case it is helpful: Spoiler:
Thanks for any pointers. Albert Edited to add: Just tried it on another machine with LO 5.2.3.3, and other software versions same as stated above, and got the same result. Last edited by st_albert; 01-14-2017 at 05:36 PM. Reason: additional test |
01-14-2017, 03:41 PM | #50 | |
Grand Sorcerer
Posts: 5,584
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
|
Quote:
Code:
ERROR(RSC-005): Error while parsing file 'element "head" incomplete; missing required element "title"'. Last edited by Doitsu; 01-17-2017 at 04:14 PM. |
|
Advert | |
|
01-14-2017, 04:16 PM | #51 | |
Connoisseur
Posts: 99
Karma: 10
Join Date: Jun 2014
Location: Poland, Żory
Device: Prestigio PER3464B, Onyx Lynx, Lenovo S5000 i Tab4-8"
|
@Doitsu...
Quote:
Regards bravosx |
|
01-14-2017, 06:41 PM | #52 |
Witchman
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
|
@Doitsu...You are correct in saying that both the TOC names and file names in the Book Browser are not dispalying with Polish characters. I derive these file names directly from the ebook file as utf-8. But unfortunately -- due to pythonissue 27344 -- zip file names on Windows can only use DOS Latin and not utf8.
And since I derive both the xhtml file names and content.xhtml toc items in the same way I am unable to fix these problems at the moment. Regarding the content.xhtml toc names, I'm still looking into how I can generate the Polish names for the toc. One way to solve both the toc and file names problem would perhaps be -- as KevinH has suggested -- to generate the epub zip file by simply using the epub_zip_up_book_contents(ebook_path, epub_filepath) in Sigil's plugin utilities. This will be a major change requiring much testing, so I think I'll do that after the plugin has settled a bit regarding other more minor errors. Last edited by slowsmile; 01-14-2017 at 11:31 PM. |
01-14-2017, 07:04 PM | #53 |
Witchman
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
|
This is a Duplicate.
Last edited by slowsmile; 01-14-2017 at 07:49 PM. |
01-14-2017, 07:34 PM | #54 | |
Grand Sorcerer
Posts: 27,549
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Quote:
Last edited by DiapDealer; 01-14-2017 at 07:46 PM. |
|
01-14-2017, 07:42 PM | #55 |
Witchman
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
|
@st_albert...On conversion to epub, the individual chapters are selected and a file split occurs in the html file if they are correctly formatted and styled with "Heading 1" style(h1) in OO or LO. Your h1 headers can either be directly styled with h1 or your named style can also be linked with h1 to be selected. Your "Heading 1" style in OO or LO should always be linked with "Heading" style in the Styles Organizer. Do not link h1 style with "Default", "Text Body" or any other style in OO or LO.
If that doesn't cure your problem, could you please attach your html file and attach the equivalent in an ODT file in your next post showing just the problem area -- consisting of just one complete chapter with heading -- in your next post so I can have a look at the formatting? I suspect that your h1 style has been set up in the wrong way in OO or LO regarding inheritance. Just to also mention that you should be linking all your own named text styles(text styles only, not headings styles) with the "Text Body" style. Doing this will ensure that all your own named text styles or classes will appear in the epub html in Sigil. So doing this prevents all your inline text styles from automatically being converted to meaningless prefixed/indexed style names like ebk-3, ebk-12, ebk-23 etc. Last edited by slowsmile; 01-14-2017 at 11:33 PM. |
01-14-2017, 07:54 PM | #56 |
Witchman
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
|
@DiapDealer...Point taken regarding the python file names issue. Now looking into using epub_zip_up_book_contents from plugin utilities to create the zip file with utf-8 file names and file contents as suggested.
Regarding file names in the Sigil Book Browser after creating the zip file and epub using the above utility function -- are the file names in the Book Browser always indexed and in English? Will the above utility also create a contents.xhtml file with the toc item names and toc heading in the correct locale language using the correct charset? Or is this governed by Sigil's Language settings in Preferences? I'm also not using WinZip or PKZip to zip up the files. I'm mainly using WinRaR and 7-Zip. Last edited by slowsmile; 01-14-2017 at 08:40 PM. |
01-14-2017, 08:38 PM | #57 |
Sigil Developer
Posts: 7,644
Karma: 5433388
Join Date: Nov 2009
Device: many
|
The order of filenames is determined by the order provided in the opf spine.
That utility method can be found in epub_utils.py here: https://github.com/Sigil-Ebook/Sigil.../epub_utils.py It assumes you have built a proper unpacked epub at ebook_path (directory where the mimetype file exists) and simply creates the zip from it properly special casing the mimetype file. There are also helper routines to create a container.xml, deal with obfuscating fints if needed, etc. Hope this helps, Kevin |
01-14-2017, 11:09 PM | #58 |
Witchman
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
|
@Kevin...Thanks for that advice on the plugin utils. I've already been investigating these utils and the change to using the epub_zip_up_book_contents and other utils certainly seems workable. My own utils in the plugin can handle the rest I think.
I'm really just trying to minimize the hit on the plugin by making such radical changes. Currently the function that splits the html file into separate xhtml files is quite complex and does somewhat more than just split the files(it also helps to create the doc TOC and Nav TOC and creates the title page as well). If and when I do implement these major changes, I think it will be later rather than sooner because I would prefer to first iron out any other minor error problems that arise with the plugin already out there before implementing such a major change. Last edited by slowsmile; 01-14-2017 at 11:26 PM. |
01-15-2017, 03:30 AM | #59 |
Witchman
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
|
@Kevin & @DiapDealer...I've got the epub_zip_up_book_contents() working in the plugin and when it converts the Polish ebook -- Brassia Grim -- to epub, the epub is exactly the same as before -- the file names displayed in the epub in Sigil's Book Browser are DOS Latin and not in UTF8 encoding showing Polish characters. The contents.xhtml toc items are also exactly the same as before.
I must also add that at no point in my plugin app do I handle read/writes to and from files in anything else but UTF8. And in my desperate trawlings for more information about zip files on the internet I stumbled across what might be a rather large gorilla in the room. I found out that the Windows NTFS file system(used on Windows 7, 8 & 10) uses UTF16 for all file names. So here's another question: Can python's ZipInfo object and flag bits be set to allow a UTF16 NTFS file name to be added to a zip file as UTF8? Or will the UTF16 filename automatically always revert to DOS Latin encoding instead in the zip archive? I'm asking this question because when I checked WinRaR's ability to change internal file name encoding by going to Options > Name Encoding in the app, there was no UTF16 option -- only UTF8. Lastly, I'm quite open and willing to believe that python's ZipFile module can convert and store UTF8 file names, but as yet I have seen no evidence of this happening either in my module or after using the epub_zip_up_book_contents() function from the PLugin Framework. It also hasn't helped that Python's documentation appears to be absolutely nil concerning proper detailed descriptions of what the zipinfo flag bits do and how to use them. I'm now off to try and perhaps find some decent and reliable flag bit code from Nullege or Git Hub and the like. |
01-15-2017, 08:45 AM | #60 |
Grand Sorcerer
Posts: 5,584
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
|
@slowsmile: The built-in epub_zip_up_book_contents() function has absolutely no problems with non-ASCII file names.
I've written a quick and dirty proof of concept input plugin that demonstrates this feature. Here's the code: Spoiler:
To test it unpack the attached test.epub file, which contains two HTML files with accented characters and umlauts (äöüß.xhtml and âîïéêë.xhtml). Then install the new junk plugin, run it, select the folder that you unpacked test.epub to, and click Yes to import the files. Note that epubcheck will complain about file names that contain non-ASCII characters and spaces. I.e., even though you could theoretically use file names with non-ASCII characters I'd strongly advise against it. |
Tags |
conversion, epub, html, odf, opendoc |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
html to epub conversion | andin1 | Conversion | 1 | 03-12-2013 06:38 PM |
Nightmare epub: it's full of tables (conversion from CHM?) | MelBr | Conversion | 2 | 02-23-2013 11:28 AM |
html to epub CLI conversion / html input | m4mmon | Conversion | 2 | 05-05-2012 02:10 AM |
Help with HTML to ePub conversion...? | Nethfel | Calibre | 4 | 05-10-2010 02:26 PM |
Converting ODF to ePub with ODFToEPub | wdonne | News | 0 | 04-22-2010 05:28 AM |