[Plugin] OpenDocHTMLImport - Full ODF HTML(Writer) conversion to epub - Page 5

KevinH · 01-15-2017, 09:42 AM

Quote:

Originally Posted by slowsmile

@Kevin & @DiapDealer.
I found out that the Windows NTFS file system(used on Windows 7, 8 & 10) uses UTF16 for all file names.

See Doitsus test case. It works properly.

No worries about utf-8 vs utf-16, as both encodings can encode every codepoint in the full unicode. That is simply not true of any of the single byte encodings.

So somehow you are reading or writing filenames/paths as latin 1 encodings.

I will take a look a look at it.

KevinH

bravosx · 01-15-2017, 09:53 AM

@Doitsu...
To your file test.epub allowed myself to add a section with Polish characters in the tests.

Regards bravosx

st_albert · 01-15-2017, 11:28 AM

Quote:

Originally Posted by slowsmile

@st_albert...

If that doesn't cure your problem, could you please attach your html file and attach the equivalent in an ODT file in your next post showing just the problem area -- consisting of just one complete chapter with heading -- in your next post so I can have a look at the formatting? .

Still no joy. I will attach the files below, in a zip archive.

By the way, the novel is copyrighted, but I have permission from the publisher to upload this sample, which contains only the first three chapters. The .html and .odt files contain the frontmatter, which is properly included in the .epub, and the first three chapters, which are not included (although they appear in the toc.ncx and contents.xhtml as they should).

Hope this helps.

Albert

KevinH · 01-15-2017, 02:01 PM

@slowsmile

Quick question .. why do you need to use bs4 to convert to utf8 here?

Code:

def convertFile2UTF8(wdir, file, encoder):
    """ Converts input file to utf-8 format
    """
    print(' -- Convert input file to utf-8 if required')
    
    original_filename = file
    output = wdir + os.sep + 'fix_encoding.htm'
    outfp = open(output, 'wt', encoding=('utf-8'))
    html = open(file, 'rt', encoding=encoder).read()  
    
    # safely convert to unicode utf-8 using bs4
    soup = BeautifulSoup(html, 'html.parser')
    outfp.writelines(str(soup))
    
    outfp.close()          
    os.remove(file)
    shutil.copy(output, file)        
    os.remove(output)
    
    return(file)

It seems a strange way to do the conversion when you know the encoding.

A short way to handle this might be to use the built in text encoding conversion when writing to and reading from files as so

Code:

    with open(file, 'rt', encoding=encoder) as f1:
        htmldat=f1.read()  
    with open(wdir + os.sep + 'fix_encoding.htm', 'wt', encoding=('utf-8')) as f2:
       f2.write(htmldat)

Or you can read in the file as bytes with binary and write it back as utf-8 using the built in bytes .decode() and string .encode() python capability:

Code:

    htmldat = open(file, 'rb').read()
    # decode converts bytes to string
    htmlstr = htmldat.decode(encoder)
    # encode converts a string to bytes in that encoding
    with open(file, 'wb') as f:
        f.write(htmlstr.encode('utf-8'))

Either would work, unless there is something else specific you are trying to achieve by having bs4 parse the entire thing and then convert it all back to unicode?

Just wondering?

KevinH

slowsmile · 01-15-2017, 05:54 PM

@KevinH...I use BS to assure conversion of the html file to Unicode UTF8.

I have also just found the problem that was inhibiting proper Polish language displays in the Book Browser and in the Table of Contents. The plugin now displays the Polish language properly for all headings both in the Book Browser and in the contents.xhtml. The problem was actually caused by one regex function that I use to cleanup heading names. When I removed the regex function everything came right. I'm still testing the plugin now with different European language ebooks just to make sure. I will probably release the new version(v0.2.8) sometime today.

And thanks also for your advice above. I will store those code bits for later use in my utils library.

KevinH · 01-15-2017, 06:24 PM

FWIW... The encode or decode routines or file io approaches will accomplish exactly that without requiring a full parse cycle.

Glad to hear you tracked down and fixed the bug. Nicely done!

Thanks!

slowsmile · 01-15-2017, 06:36 PM

@st_albert...I've just had a look at the html for your ebook. The formatting is fine and, as I've said, because you've linked all your text styles to "Text Body" this is why all your own style names are also being displayed in the epub.

When I converted your ebook to epub using the plugin it converted without any problems at all. I did find a problem with the begin read location in the guide section of the content.opf file -- My code code not find "Chapter One". The problem was caused because you used chapter headings of the form "Chapter One", "Chapter Two" etc rather than using "Chapter 1", "Chapter 2". When I changed your headings to the latter form your Galactic Frontiers epub also passed EpubCheck.

I will try and put in a fix to accommodate chapter headings like "Chapter One", "Chapter Two". This will hopefully be done today. I'm currently testing another problem which has also been fixed. The fix for your problem will probably be in v0.2.8.

I've also sent you the epub version of your ebook that was converted using my plugin. See below.

slowsmile · 01-15-2017, 06:59 PM

@Doitsu & @Kevin...Also grateful for your explanations concerning NTFS UTF16 etc. That one had me gnawing my ankles with frustration...

slowsmile · 01-15-2017, 08:06 PM

To avoid any confusion about how my plugin converts or manipulates the user's inline styles and named text styles in html, I thought that I should explain it more for some clarity.

My plugin is probably unique as a converter in that it reformats or manipulates all html text styles(classes) and in-tag styling on 3 levels:

* If you have linked all your text styles to "Text Body" in OO or LO then all your named styles will show as classes in the html file. These user named styles will also therefore be ported to and will show in the epub as well.

* If you have not used named styles in OO or LO or if your text styles do not inherit "Text Body" then the plugin will use a complex algorithm(yes, the function code is a bit horrifying but it nevertheless works well) and do its best to determine what your inline text styling does and then it will convert your inline styling to a suitably named text class. There are four core text styles that are used for this in the epub CSS: ebk-centered-text, ebk-blocktext, ebk-text-with-indent and ebk-text-no-indent. This feature also helps to reduce the number of meaningless prefixed/indexed classes(which I have always disliked) in the epub html.

* Any in-tag text styling that cannot be determined will be converted to prefixed/indexed named classes of the form: ebk-5, ebk-12, ebk-23 etc.

In other words, from the above, my plugin app will try to adjust to the way you have styled your ODT doc and will give you the epub that you deserve. So if you've used named text styles linked with "Text Body" throughout your doc, then your epub html will look good and will be easy to work with. But if you style your doc without using "Text Body" or named styles -- your epub html won't look so good. Like I said, you get what you deserve with this plugin according to your own styling efforts within the ODT doc.

I also have to say that I couldn't have achieved the above without bs4 and pytidylib. And regarding html manipulations -- I'm now convinced that you can do anything you like in html using bs4. Anything.

slowsmile · 01-15-2017, 11:00 PM

@bravosx...Your problem has now been fixed in v0.2.8 which has just been released. Your ebook now displays correct Polish in both the Book Browser and in the Table of Contents(content.xhtml).

When you run EpubCheck you might also get this Warning:

WARNING(PKG-012): File name contains the following non-ascii characters: ?. Consider changing the filename.

This is only a warning -- not an error. The above warning should be ignored and will not stop you uploading your ebook to Kindle or other epub vendors without errors(I have also tested this). This incorrect warning is probably because EpubCheck does not use utf8 to check internal epub file names.

slowsmile · 01-15-2017, 11:06 PM

@st_albert...I've also fixed the problem concerning your use of "Chapter One", "Chapter Two" etc and the begin read guides problem. Begin read location will now accept: "Chapter 1" or "Chapter One" or "1". This fix is in v0.2.8 which has just been released. See Changes in the release notes for more details.

bravosx · 01-16-2017, 08:26 AM

@slowsmile...

Quote:

Originally Posted by slowsmile

@bravosx...Your problem has now been fixed in v0.2.8 which has just been released...

A heartfelt thank you for the great work to the plugin to work properly in my mother tongue: Polish language. Once again, a great respect for you, and also for all those who in any way have contributed to this. I will mention only a few: @KevinH, @Doitsu...

Now to the heart of the matter.

I made a set of trials with a larger text volume saving in LibreOffice as Document HTML (Writer) w Tools tab>Options>Load/Save>HTML Compatibility. In the Character set dropdown select:

1) Central European (Windows-1250/WinLatin2)
2) Western European (Windows-1252/WinLatin1)
3) UNICODE (UTF-8)

that contains:
- Prologue (prolog.text with Polish characters),
- Twelve chapters (each named according to the following formula: 1. the text of Polish characters, 2. the text of the Polish characters etc.),
- And epilogue (epilog.text with Polish characters).

In the Book Browser/folder_Text in all sections are correctly displayed Polish characters.

However, in the Table Of Contents window display only words: prolog, numbers from 1 to 12 and epilog (all without additional text). I added in the file. ODT in the names of the chapters the word Rozdział (for example. Rozdział 1. additional text) and saved as html, and then import the Sigil using plug-ins in the Table Of Contents display the words: Prolog, Rozdział 1-12 and Epilog (all without additional text).
But this is not a problem, because using Ctrl+T and by confirming OK, the entry in the text window is fixed TOC display properly saved correctly showing Polish characters.

After starting EpubCheck received this warning:
WARNING(PKG-012): File name contains the following non-ascii characters: ?. Consider changing the filename.

But as you write in post # 70 this is not a problem. You can get rid of this message by changing in the Book Browser/folder_Text section name so that there was no Polish characters.

I can't put on the forum of the starting material used and converted and saved as an html file and epub, because I'm not sure what the copyright text. Alternatively, for inspection on e'mail.

To summarize my lengthy text, I can confirm that for my Polish language plugin works properly.

Once again sorry for my very poor knowledge of English and a big thanks. I greet all the members of this forum.

bravosx

st_albert · 01-16-2017, 01:09 PM

@slowsmile

Thanks for your efforts. However, on my linux (Kubuntu, xenial, 16.04.1) box I'm still getting the same problem. Note that I am not using "bundled python" on this OS. Python version is Python version: 3.5.2 (default, Nov 17 2016, 17:05:23) on this machine.

Thinking it might be due to an OS problem, I installed sigil 0.9.7 on a Win-10 x64 machine, using bundled Python. The testplugin ran successfully, but the OpenDoc import plugin failed with the following log:

Spoiler:

The tidy.dll library exists in the path shown above.

This happened with both versions 0.2.7 and 0.2.8 of the plugin.

What OS are you using?

Albert

Doitsu · 01-16-2017, 01:48 PM

Quote:

Originally Posted by st_albert

[B]Thanks for your efforts. However, on my linux (Kubuntu, xenial, 16.04.1) box I'm still getting the same problem. Note that I am not using "bundled python" on this OS. Python version is Python version: 3.5.2 (default, Nov 17 2016, 17:05:23) on this machine.

I tested the new 0.2.8 version on my 64bit Arch Linux machine and your sample file imported without error messages, however, all text after the title page was skipped and the TOC is broken.
OTOH, I successfully tested the new 0.2.8 version on my Windows machine. Did you install the official Sigil release or a portable version?

@slowsmile: Unless you have access to a Linux machine, you might want to remove Linux from the list of supported operating systems in plugin.xml.

st_albert · 01-16-2017, 02:15 PM

Quote:

Originally Posted by Doitsu

I tested the new 0.2.8 version on my 64bit Arch Linux machine and your sample file imported without error messages, however, all text after the title page was skipped and the TOC is broken.
OTOH, I successfully tested the new 0.2.8 version on my Windows machine. Did you install the official Sigil release or a portable version?

Yes, that's what I'm seeing on (K)ubuntu.

As for the Win-10 version, I got the install file directly from Sigil-Ebook on Github. It installed smoothly, and even installed the MS runtime stuff with no problem. And, as I said, it passed the testplugin (ver 0.13 IIRC).

Albert

01-15-2017, 05:54 PM	#65
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	@KevinH...I use BS to assure conversion of the html file to Unicode UTF8. I have also just found the problem that was inhibiting proper Polish language displays in the Book Browser and in the Table of Contents. The plugin now displays the Polish language properly for all headings both in the Book Browser and in the contents.xhtml. The problem was actually caused by one regex function that I use to cleanup heading names. When I removed the regex function everything came right. I'm still testing the plugin now with different European language ebooks just to make sure. I will probably release the new version(v0.2.8) sometime today. And thanks also for your advice above. I will store those code bits for later use in my utils library. Last edited by slowsmile; 01-15-2017 at 05:59 PM.

01-15-2017, 08:06 PM	#69
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	To avoid any confusion about how my plugin converts or manipulates the user's inline styles and named text styles in html, I thought that I should explain it more for some clarity. My plugin is probably unique as a converter in that it reformats or manipulates all html text styles(classes) and in-tag styling on 3 levels: * If you have linked all your text styles to "Text Body" in OO or LO then all your named styles will show as classes in the html file. These user named styles will also therefore be ported to and will show in the epub as well. * If you have not used named styles in OO or LO or if your text styles do not inherit "Text Body" then the plugin will use a complex algorithm(yes, the function code is a bit horrifying but it nevertheless works well) and do its best to determine what your inline text styling does and then it will convert your inline styling to a suitably named text class. There are four core text styles that are used for this in the epub CSS: ebk-centered-text, ebk-blocktext, ebk-text-with-indent and ebk-text-no-indent. This feature also helps to reduce the number of meaningless prefixed/indexed classes(which I have always disliked) in the epub html. * Any in-tag text styling that cannot be determined will be converted to prefixed/indexed named classes of the form: ebk-5, ebk-12, ebk-23 etc. In other words, from the above, my plugin app will try to adjust to the way you have styled your ODT doc and will give you the epub that you deserve. So if you've used named text styles linked with "Text Body" throughout your doc, then your epub html will look good and will be easy to work with. But if you style your doc without using "Text Body" or named styles -- your epub html won't look so good. Like I said, you get what you deserve with this plugin according to your own styling efforts within the ODT doc. I also have to say that I couldn't have achieved the above without bs4 and pytidylib. And regarding html manipulations -- I'm now convinced that you can do anything you like in html using bs4. Anything. Last edited by slowsmile; 01-16-2017 at 01:11 AM.

01-15-2017, 11:00 PM	#70
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	@bravosx...Your problem has now been fixed in v0.2.8 which has just been released. Your ebook now displays correct Polish in both the Book Browser and in the Table of Contents(content.xhtml). When you run EpubCheck you might also get this Warning: WARNING(PKG-012): File name contains the following non-ascii characters: ?. Consider changing the filename. This is only a warning -- not an error. The above warning should be ignored and will not stop you uploading your ebook to Kindle or other epub vendors without errors(I have also tested this). This incorrect warning is probably because EpubCheck does not use utf8 to check internal epub file names. Last edited by slowsmile; 01-16-2017 at 02:23 AM.

01-15-2017, 11:06 PM	#71
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	@st_albert...I've also fixed the problem concerning your use of "Chapter One", "Chapter Two" etc and the begin read guides problem. Begin read location will now accept: "Chapter 1" or "Chapter One" or "1". This fix is in v0.2.8 which has just been released. See Changes in the release notes for more details. Last edited by slowsmile; 01-15-2017 at 11:10 PM.

01-16-2017, 01:09 PM	#73
st_albert Guru Posts: 696 Karma: 150000 Join Date: Feb 2010 Device: none	@slowsmile Thanks for your efforts. However, on my linux (Kubuntu, xenial, 16.04.1) box I'm still getting the same problem. Note that I am not using "bundled python" on this OS. Python version is Python version: 3.5.2 (default, Nov 17 2016, 17:05:23) on this machine. Thinking it might be due to an OS problem, I installed sigil 0.9.7 on a Win-10 x64 machine, using bundled Python. The testplugin ran successfully, but the OpenDoc import plugin failed with the following log: Spoiler: Status: failed Python version: 3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:18:55) [MSC v.1900 64 bit (AMD64)] Running OpenDocHTMLImport... -- User input validation checks... -- Main html file found...PASS -- eBook cover file found...PASS -- Input file validation checks... -- Input html file is in OpenDoc HTML format...PASS -- "Heading 1" style is used in the input html file....PASS -- Start conversion to epub... -- Gathering metadata... -- Input file name = X:/commons/scratch-epub/GalacticFrontiers_sample.html -- Author name = Darrell Bain -- Title = Galactic Frontiers -- Cover file name = 9781606193709.jpg -- Found 1 ebook images in your local dir -- Input file encoding is: UTF-8 -- Convert input file to utf-8 if required Traceback (most recent call last): File "C:\Program Files\Sigil\plugin_launchers\python\launcher.py", line 135, in launch self.exitcode = target_script.run(container) File "C:\Users\u838190\AppData\Local\sigil-ebook\sigil\plugins\OpenDocHTMLImport\plugin.py", line 86, in run epub_path = convert2Epub(html_file_path) File "C:\Users\u838190\AppData\Local\sigil-ebook\sigil\plugins\OpenDocHTMLImport\convert.py", line 75, in convert2Epub docTidyNoWrap(WDIR, file) File "C:\Users\u838190\AppData\Local\sigil-ebook\sigil\plugins\OpenDocHTMLImport\doc_tidy.py" , line 120, in docTidyNoWrap html, errors = tidy_document(xhtml, options=base_options) File "C:\Users\u838190\AppData\Local\sigil-ebook\sigil\plugins\OpenDocHTMLImport\tidylib\tidy .py", line 293, in tidy_document return get_module_tidy().tidy_document(text, options) File "C:\Users\u838190\AppData\Local\sigil-ebook\sigil\plugins\OpenDocHTMLImport\tidylib\tidy .py", line 305, in get_module_tidy _tidy = Tidy() File "C:\Users\u838190\AppData\Local\sigil-ebook\sigil\plugins\OpenDocHTMLImport\tidylib\tidy .py", line 160, in __init__ "\nCould not load library: " + self.libpath) OSError: Could not load library: C:\Users\u838190\AppData\Local\sigil-ebook\sigil\plugins\OpenDocHTMLImport\tidylib\win6 4\tidy.dll Error: Could not load library: C:\Users\u838190\AppData\Local\sigil-ebook\sigil\plugins\OpenDocHTMLImport\tidylib\win6 4\tidy.dll The tidy.dll library exists in the path shown above. This happened with both versions 0.2.7 and 0.2.8 of the plugin. What OS are you using? Albert

01-15-2017, 06:24 PM	#66
KevinH Sigil Developer Posts: 7,669 Karma: 5433388 Join Date: Nov 2009 Device: many	FWIW... The encode or decode routines or file io approaches will accomplish exactly that without requiring a full parse cycle. Glad to hear you tracked down and fixed the bug. Nicely done! Thanks!

01-15-2017, 06:59 PM	#68
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	@Doitsu & @Kevin...Also grateful for your explanations concerning NTFS UTF16 etc. That one had me gnawing my ankles with frustration...

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
html to epub conversion	andin1	Conversion	1	03-12-2013 06:38 PM
Nightmare epub: it's full of tables (conversion from CHM?)	MelBr	Conversion	2	02-23-2013 11:28 AM
html to epub CLI conversion / html input	m4mmon	Conversion	2	05-05-2012 02:10 AM
Help with HTML to ePub conversion...?	Nethfel	Calibre	4	05-10-2010 02:26 PM
Converting ODF to ePub with ODFToEPub	wdonne	News	0	04-22-2010 05:28 AM

Advert

Advert