Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Sigil > Plugins

Notices

Reply
 
Thread Tools Search this Thread
Old 01-14-2017, 10:19 AM   #46
slowsmile
Witchman
slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.
 
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
bravosx...Thank you for sending those files which is a great help.

After looking at your files here are the conclusions I arrived at -- And to keep it short, I'm only going describe what I found in your Bracia Grim_Cp-1250Latin2_Save as.html file:

* Your ebook only consists of 5 pages. My plugin is really meant for 200-300 page ebooks with or without images.

* Alot of your files for conversion had '.xhtml' extensions. They should all have '.html' extensions when exported from LibreOffice as html. Please don't use files with '.xhtml' extensions with the plugin.

* When I looked in your Bracia Grim ODT file in Options > Load/Save > HTML Compatibity the Character set the had been set to "Big5" for some unknown reason. In fact, some of your html files that you sent me had also "Big5" set as their html character set. Why are you using a traditional Chinese character set for your Polish html file? In your LibreOffice application, please be sure to set your charset back to Windows-1250/Latin2 which is the correct character set that you should be using for the Polish language.

* On conversion to epub with my plugin both the title and the heading were correctly found and the epub contents.xhtml file was populated with just one toc item which is correct behaviour for the plugin with your 5 page ebook with one heading.

* All the xhtml fies that you sent me all had no DOCTYPE and no XMLNS headers - both were missing. That means that they will even fail when you try to view them in Chrome browser. Never use xhtml files in the plugin -- they have a completely different layout and format compared to '.html' files.

* Despite the above charset and xhtml problems(which I didn't change or alter), when I converted your Bracia Grim_Cp_1250Latin2_Save as.html file to epub using my plugin, the correct charset -- Windows-1250/Latin2 -- was found by the plugin and the epub file also used the correct Polish charset with all ligatures and glyphs present and showing in Text View in Sigil. This file also passed EpubCheck first time and when I converted this file to Kindle using Kindle Previewer 3.7 it converted without any problems and the Kindle displayed properly in the Polish language.

As proof of the above I've sent you both the epub and Kindle mobi version of your unchanged Bracia Grim_Cp-1250Latin2_Save as.epub file. Please note that the Kindle version of your ebook also seems to display the Polish text correctly and so must also be using the correct windows-1250/latin2 charset.

See attachments below.
Attached Files
File Type: mobi Bracia Grim.mobi (1.66 MB, 306 views)
File Type: epub Bracia Grim.epub (768.5 KB, 349 views)

Last edited by slowsmile; 01-14-2017 at 10:35 AM.
slowsmile is offline   Reply With Quote
Old 01-14-2017, 11:39 AM   #47
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,582
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
@slowsmile: It might be a bit difficult to spot at first glance if you don't know what to look for, but even in the epub that you generated TOC entries in the NCX TOC are missing Polish national characters.

For example, the chapter title of the first chapter is Rozdział 1. (Note the stroked L before the 1.) However this special character is missing in TOC.NCX:

Code:
    <navPoint id="navPoint-3" playOrder="3">
      <navLabel>
        <text>Rozdzia 1</text>
      </navLabel>
      <content src="Text/rozdzia_1.xhtml"/>
    </navPoint>
It should read:


Code:
    <navPoint id="navPoint-3" playOrder="3">
      <navLabel>
        <text> Rozdział 1</text>
      </navLabel>
      <content src="Text/rozdzia_1.xhtml"/>
    </navPoint>
I.e., there's a bug in the TOC generation code.

@bravosx: As a temporary fix, you could simply regenerate the TOC via CTRL+T. This should restore the missing characters since the actual headings contain them.
Doitsu is offline   Reply With Quote
Advert
Old 01-14-2017, 12:31 PM   #48
bravosx
Connoisseur
bravosx began at the beginning.
 
Posts: 99
Karma: 10
Join Date: Jun 2014
Location: Poland, Żory
Device: Prestigio PER3464B, Onyx Lynx, Lenovo S5000 i Tab4-8"
Quote:
@bravosx: As a temporary fix, you could simply regenerate the TOC via CTRL+T. This should restore the missing characters since the actual headings contain them.
OK, actually this way it is possible to improve the TOC and seems to be fairly easy. Thanks for the tip.

The problem remains the text contained between the <title></title>, for example, the lack of Polish character in the word Rozdział:

Code:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
  "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

<html xml:lang="pl-PL" xmlns="http://www.w3.org/1999/xhtml">
<head>
  <title>Rozdzia_1</title>
  <link href="../Styles/stylesheet.css" rel="stylesheet" type="text/css"/>
</head>
Thanks, bravosx
bravosx is offline   Reply With Quote
Old 01-14-2017, 01:45 PM   #49
st_albert
Guru
st_albert gives new meaning to the word 'superlative.'st_albert gives new meaning to the word 'superlative.'st_albert gives new meaning to the word 'superlative.'st_albert gives new meaning to the word 'superlative.'st_albert gives new meaning to the word 'superlative.'st_albert gives new meaning to the word 'superlative.'st_albert gives new meaning to the word 'superlative.'st_albert gives new meaning to the word 'superlative.'st_albert gives new meaning to the word 'superlative.'st_albert gives new meaning to the word 'superlative.'st_albert gives new meaning to the word 'superlative.'
 
Posts: 695
Karma: 150000
Join Date: Feb 2010
Device: none
Just for a little change of pace, here's a problem I'm having.

I started with a LibreOffice .odt file containing a novel I'm working on. It has several custom styles, which all "inherit from" Text Body.

LO version is 5.1.4.2 on Kubuntu 16.04.1
plugin version is 0.2.7
Sigil version is 0.9.7

The writer HTML file was created via "save as" and selecting HTML document (writer) as the format.

The plugin runs without error, and correctly imports the cover image, builds a correct toc.ncx and a correct HTML contents file, but no text files are created after the frontmatter. That is to say the frontmatter (title, copyright, and dedication) is included, but nothing is included from the first h1 tag on. The TOC refers to flies like "../Text/chapter_one.xhtml" and so on, but those files are not present.

Must be something I'm doing wrong, or someone surely would have mentioned it before now.

Here's the import log, in case it is helpful:
Spoiler:

Status: success

Python version: 3.5.2 (default, Nov 17 2016, 17:05:23)
[GCC 5.4.0 20160609]

Running OpenDocHTMLImport...

-- User input validation checks...
-- Main html file found...PASS
-- eBook cover file found...PASS

-- Input file validation checks...
-- Input html file is in OpenDoc HTML format...PASS
-- "Heading 1" style is used in the input html file....PASS

-- Start conversion to epub...

-- Gathering metadata...
-- Input file name = /home/u838190/tmp/scratch-epub/GalacticFrontiers_HTMLtest.html
-- Author name = Darrel Bain
-- Title = Galactic Frontiers
-- Cover file name = 9781606193709.jpg
-- Found 1 ebook images in your local dir
>>> html enc...utf-8
>>> chardet enc...ascii
-- Input file encoding is: UTF-8
-- Convert input file to utf-8 if required
-- Reformat and remove garbage from html styles...

-- Clean, fix and sanitize html garbage code...
-- Fix mixed encoding errors
-- Remove adhoc garbage code...
-- Remove all extraneous text spaces
-- Remove all hard line breaks(<br/>)
-- Remove all tab spaces
-- Remove all "dir", "lang", "name", "id", "align" and "link" attributes
-- Remove all anchors, bookmarks and page links

-- Remove all proprietary garbage code from the html file
-- Preserve and keep all external internet links
-- Remove all internal page links
-- Remove all line-height and font family declarations
-- Remove all isolated </p> tags and </span> tags
-- Remove div tags
-- Remove all page-break refs in styles

-- Cleanup punctuation...
-- Change dumb quotes to curly quotes
-- Convert triple periods to ellipsis
-- Remove the doc TOC if present

-- Create the stylesheet...
-- Creating the CSS file
-- Reformat the CSS File
-- Reformat and insert ebook images
-- Move HTML inline styles to CSS
-- Split all chapters/headers into separate xhtml files
-- Add meta headers to all the new html header files

-- Normalize the CSS file...
-- Remove unwanted attributes from the CSS
-- Remove adhoc garbage from the CSS
-- Add useful and helpful globals and presets to CSS
-- Adjust CSS body attributes
-- Convert absolute to relative values in the CSS

-- Build the zip archive...
-- Creating the cover image file
-- Create the TOC file
-- Create the container XML file
-- Create the toc.ncx XML file
-- Build the content.opf file

-- Add the Go To guides for toc, cover and begin read.
-- Tidy up and indent all XML layouts

-- Create the zip file archive...
-- Adding files to the new zip file...
-- Add mimetype file
-- Add container.xml file
-- Add toc.ncx file
-- Add content.opf file
-- Add ebook cover file
-- Add content.xhtml file
-- Add stylesheet file
-- Add cover image file

-- Add all ebook image files to the zip Images directory
-- Add all HTML text header files to the zip Text directory
-- Converting the zip archive to epub format...

-- An epub was SUCCESSFULLY created



Thanks for any pointers.

Albert

Edited to add: Just tried it on another machine with LO 5.2.3.3, and other software versions same as stated above, and got the same result.

Last edited by st_albert; 01-14-2017 at 05:36 PM. Reason: additional test
st_albert is offline   Reply With Quote
Old 01-14-2017, 03:41 PM   #50
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,582
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
Quote:
Originally Posted by bravosx View Post
The problem remains the text contained between the <title></title>, for example, the lack of Polish character in the word Rozdział:
The <title>...</title> tag isn't used by epub apps. I.e., it can be empty or contain random characters. It only needs to be included for backwards compatibility with older apps; you also get the following epubcheck error, if it isn't included:

Code:
ERROR(RSC-005): Error while parsing file 'element "head" incomplete; missing required element "title"'.
Quote:
Originally Posted by st_albert View Post
The plugin runs without error, and correctly imports the cover image, builds a correct toc.ncx and a correct HTML contents file, but no text files are created after the frontmatter.
When I ran tests I also encountered similar problems with some files, but I chalked it down to LibreOffice/OS compatibility problems, and since I don't really need this plugin, I didn't investigate this further.
Attached Files
File Type: zip de_hunspell_utf8.zip (1,001.2 KB, 457 views)

Last edited by Doitsu; 01-17-2017 at 04:14 PM.
Doitsu is offline   Reply With Quote
Advert
Old 01-14-2017, 04:16 PM   #51
bravosx
Connoisseur
bravosx began at the beginning.
 
Posts: 99
Karma: 10
Join Date: Jun 2014
Location: Poland, Żory
Device: Prestigio PER3464B, Onyx Lynx, Lenovo S5000 i Tab4-8"
@Doitsu...

Quote:
Originally Posted by Doitsu View Post
The <title>...</title> tag isn't used by epub apps. I.e., it can be empty or contain random characters. It only needs to be included for backwards compatibility with older apps; you also get the following epubcheck error, if it isn't included:

Code:
ERROR(RSC-005): Error while parsing file 'element "head" incomplete; missing required element "title"'.
OK. Thanks for the explanation of the problem.

Regards bravosx
bravosx is offline   Reply With Quote
Old 01-14-2017, 06:41 PM   #52
slowsmile
Witchman
slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.
 
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
@Doitsu...You are correct in saying that both the TOC names and file names in the Book Browser are not dispalying with Polish characters. I derive these file names directly from the ebook file as utf-8. But unfortunately -- due to pythonissue 27344 -- zip file names on Windows can only use DOS Latin and not utf8.

And since I derive both the xhtml file names and content.xhtml toc items in the same way I am unable to fix these problems at the moment. Regarding the content.xhtml toc names, I'm still looking into how I can generate the Polish names for the toc. One way to solve both the toc and file names problem would perhaps be -- as KevinH has suggested -- to generate the epub zip file by simply using the epub_zip_up_book_contents(ebook_path, epub_filepath) in Sigil's plugin utilities. This will be a major change requiring much testing, so I think I'll do that after the plugin has settled a bit regarding other more minor errors.

Last edited by slowsmile; 01-14-2017 at 11:31 PM.
slowsmile is offline   Reply With Quote
Old 01-14-2017, 07:04 PM   #53
slowsmile
Witchman
slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.
 
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
This is a Duplicate.

Last edited by slowsmile; 01-14-2017 at 07:49 PM.
slowsmile is offline   Reply With Quote
Old 01-14-2017, 07:34 PM   #54
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 27,532
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Quote:
Originally Posted by slowsmile View Post
But unfortunately -- due to pythonissue 27344 -- zip file names on Windows can only use DOS Latin and not utf8.
As KevinH pointed out above, this is simply not true. The bug you're pointing to is a documentation issue only. Python's zipfile module from 2.7 on is perfectly capable of handling utf-8 filenames on Windows -- as is Sigil's internal (un)zip routines. Winzip and PKZip are the programs that are limited to DOS Latin file names on Windows. So unless you're telling us that you're using Winzip or PKZip as part of your plugin instead of Python's zipfile module (and I really hope you're not), the Python documentation issue you're pointing to just isn't relevant here.

Last edited by DiapDealer; 01-14-2017 at 07:46 PM.
DiapDealer is offline   Reply With Quote
Old 01-14-2017, 07:42 PM   #55
slowsmile
Witchman
slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.
 
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
@st_albert...On conversion to epub, the individual chapters are selected and a file split occurs in the html file if they are correctly formatted and styled with "Heading 1" style(h1) in OO or LO. Your h1 headers can either be directly styled with h1 or your named style can also be linked with h1 to be selected. Your "Heading 1" style in OO or LO should always be linked with "Heading" style in the Styles Organizer. Do not link h1 style with "Default", "Text Body" or any other style in OO or LO.

If that doesn't cure your problem, could you please attach your html file and attach the equivalent in an ODT file in your next post showing just the problem area -- consisting of just one complete chapter with heading -- in your next post so I can have a look at the formatting?

I suspect that your h1 style has been set up in the wrong way in OO or LO regarding inheritance.

Just to also mention that you should be linking all your own named text styles(text styles only, not headings styles) with the "Text Body" style. Doing this will ensure that all your own named text styles or classes will appear in the epub html in Sigil. So doing this prevents all your inline text styles from automatically being converted to meaningless prefixed/indexed style names like ebk-3, ebk-12, ebk-23 etc.

Last edited by slowsmile; 01-14-2017 at 11:33 PM.
slowsmile is offline   Reply With Quote
Old 01-14-2017, 07:54 PM   #56
slowsmile
Witchman
slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.
 
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
@DiapDealer...Point taken regarding the python file names issue. Now looking into using epub_zip_up_book_contents from plugin utilities to create the zip file with utf-8 file names and file contents as suggested.

Regarding file names in the Sigil Book Browser after creating the zip file and epub using the above utility function -- are the file names in the Book Browser always indexed and in English?

Will the above utility also create a contents.xhtml file with the toc item names and toc heading in the correct locale language using the correct charset? Or is this governed by Sigil's Language settings in Preferences?

I'm also not using WinZip or PKZip to zip up the files. I'm mainly using WinRaR and 7-Zip.

Last edited by slowsmile; 01-14-2017 at 08:40 PM.
slowsmile is offline   Reply With Quote
Old 01-14-2017, 08:38 PM   #57
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,602
Karma: 5433388
Join Date: Nov 2009
Device: many
The order of filenames is determined by the order provided in the opf spine.

That utility method can be found in epub_utils.py here:

https://github.com/Sigil-Ebook/Sigil.../epub_utils.py

It assumes you have built a proper unpacked epub at ebook_path (directory where the mimetype file exists) and simply creates the zip from it properly special casing the mimetype file.

There are also helper routines to create a container.xml, deal with obfuscating fints if needed, etc.

Hope this helps,

Kevin
KevinH is offline   Reply With Quote
Old 01-14-2017, 11:09 PM   #58
slowsmile
Witchman
slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.
 
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
@Kevin...Thanks for that advice on the plugin utils. I've already been investigating these utils and the change to using the epub_zip_up_book_contents and other utils certainly seems workable. My own utils in the plugin can handle the rest I think.

I'm really just trying to minimize the hit on the plugin by making such radical changes. Currently the function that splits the html file into separate xhtml files is quite complex and does somewhat more than just split the files(it also helps to create the doc TOC and Nav TOC and creates the title page as well). If and when I do implement these major changes, I think it will be later rather than sooner because I would prefer to first iron out any other minor error problems that arise with the plugin already out there before implementing such a major change.

Last edited by slowsmile; 01-14-2017 at 11:26 PM.
slowsmile is offline   Reply With Quote
Old 01-15-2017, 03:30 AM   #59
slowsmile
Witchman
slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.
 
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
@Kevin & @DiapDealer...I've got the epub_zip_up_book_contents() working in the plugin and when it converts the Polish ebook -- Brassia Grim -- to epub, the epub is exactly the same as before -- the file names displayed in the epub in Sigil's Book Browser are DOS Latin and not in UTF8 encoding showing Polish characters. The contents.xhtml toc items are also exactly the same as before.

I must also add that at no point in my plugin app do I handle read/writes to and from files in anything else but UTF8.

And in my desperate trawlings for more information about zip files on the internet I stumbled across what might be a rather large gorilla in the room.

I found out that the Windows NTFS file system(used on Windows 7, 8 & 10) uses UTF16 for all file names.

So here's another question:

Can python's ZipInfo object and flag bits be set to allow a UTF16 NTFS file name to be added to a zip file as UTF8? Or will the UTF16 filename automatically always revert to DOS Latin encoding instead in the zip archive? I'm asking this question because when I checked WinRaR's ability to change internal file name encoding by going to Options > Name Encoding in the app, there was no UTF16 option -- only UTF8.

Lastly, I'm quite open and willing to believe that python's ZipFile module can convert and store UTF8 file names, but as yet I have seen no evidence of this happening either in my module or after using the epub_zip_up_book_contents() function from the PLugin Framework. It also hasn't helped that Python's documentation appears to be absolutely nil concerning proper detailed descriptions of what the zipinfo flag bits do and how to use them. I'm now off to try and perhaps find some decent and reliable flag bit code from Nullege or Git Hub and the like.
slowsmile is offline   Reply With Quote
Old 01-15-2017, 08:45 AM   #60
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,582
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
@slowsmile: The built-in epub_zip_up_book_contents() function has absolutely no problems with non-ASCII file names.

I've written a quick and dirty proof of concept input plugin that demonstrates this feature.

Here's the code:

Spoiler:
Code:
#!/usr/bin/env python
import os, codecs, tkinter.filedialog
from epub_utils import epub_zip_up_book_contents

# DiapDealer's temp folder code
from contextlib import contextmanager
@contextmanager
def make_temp_directory():
    import tempfile
    import shutil
    temp_dir = tempfile.mkdtemp()
    yield temp_dir
    shutil.rmtree(temp_dir)

def run(bk):
    unpacked_epub = tkinter.filedialog.askdirectory(title = 'Select the epub folder.')
    if unpacked_epub is not None and os.path.isfile(os.path.join(unpacked_epub, 'mimetype')):
        with make_temp_directory() as epub_td:
            epub_path = os.path.join(epub_td, 'temp.epub')
            epub_zip_up_book_contents(unpacked_epub, epub_path)
            with codecs.open(epub_path, 'rb') as fp:
                data = fp.read()
            bk.addotherfile('dummy.epub', data)
        return 0
    else:
        print('Folder selection error.')
        return -1

def main():
    print('I reached main when I should not have\n')
    return -1

if __name__ == "__main__":
    sys.exit(main())


To test it unpack the attached test.epub file, which contains two HTML files with accented characters and umlauts (äöüß.xhtml and âîïéêë.xhtml).

Then install the new junk plugin, run it, select the folder that you unpacked test.epub to, and click Yes to import the files.

Note that epubcheck will complain about file names that contain non-ASCII characters and spaces. I.e., even though you could theoretically use file names with non-ASCII characters I'd strongly advise against it.
Attached Files
File Type: zip junk_v0.2.zip (1.0 KB, 293 views)
File Type: epub test.epub (3.2 KB, 336 views)
Doitsu is offline   Reply With Quote
Reply

Tags
conversion, epub, html, odf, opendoc

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
html to epub conversion andin1 Conversion 1 03-12-2013 06:38 PM
Nightmare epub: it's full of tables (conversion from CHM?) MelBr Conversion 2 02-23-2013 11:28 AM
html to epub CLI conversion / html input m4mmon Conversion 2 05-05-2012 02:10 AM
Help with HTML to ePub conversion...? Nethfel Calibre 4 05-10-2010 02:26 PM
Converting ODF to ePub with ODFToEPub wdonne News 0 04-22-2010 05:28 AM


All times are GMT -4. The time now is 01:57 AM.


MobileRead.com is a privately owned, operated and funded community.