|
|
Thread Tools | Search this Thread |
01-15-2017, 09:42 AM | #61 | |
Sigil Developer
Posts: 7,669
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Quote:
No worries about utf-8 vs utf-16, as both encodings can encode every codepoint in the full unicode. That is simply not true of any of the single byte encodings. So somehow you are reading or writing filenames/paths as latin 1 encodings. I will take a look a look at it. KevinH |
|
01-15-2017, 09:53 AM | #62 |
Connoisseur
Posts: 99
Karma: 10
Join Date: Jun 2014
Location: Poland, Żory
Device: Prestigio PER3464B, Onyx Lynx, Lenovo S5000 i Tab4-8"
|
@Doitsu...
To your file test.epub allowed myself to add a section with Polish characters in the tests. Regards bravosx Last edited by bravosx; 01-15-2017 at 10:15 AM. |
Advert | |
|
01-15-2017, 11:28 AM | #63 | |
Guru
Posts: 696
Karma: 150000
Join Date: Feb 2010
Device: none
|
Quote:
By the way, the novel is copyrighted, but I have permission from the publisher to upload this sample, which contains only the first three chapters. The .html and .odt files contain the frontmatter, which is properly included in the .epub, and the first three chapters, which are not included (although they appear in the toc.ncx and contents.xhtml as they should). Hope this helps. Albert |
|
01-15-2017, 02:01 PM | #64 |
Sigil Developer
Posts: 7,669
Karma: 5433388
Join Date: Nov 2009
Device: many
|
@slowsmile
Quick question .. why do you need to use bs4 to convert to utf8 here? Code:
def convertFile2UTF8(wdir, file, encoder): """ Converts input file to utf-8 format """ print(' -- Convert input file to utf-8 if required') original_filename = file output = wdir + os.sep + 'fix_encoding.htm' outfp = open(output, 'wt', encoding=('utf-8')) html = open(file, 'rt', encoding=encoder).read() # safely convert to unicode utf-8 using bs4 soup = BeautifulSoup(html, 'html.parser') outfp.writelines(str(soup)) outfp.close() os.remove(file) shutil.copy(output, file) os.remove(output) return(file) A short way to handle this might be to use the built in text encoding conversion when writing to and reading from files as so Code:
with open(file, 'rt', encoding=encoder) as f1: htmldat=f1.read() with open(wdir + os.sep + 'fix_encoding.htm', 'wt', encoding=('utf-8')) as f2: f2.write(htmldat) Code:
htmldat = open(file, 'rb').read() # decode converts bytes to string htmlstr = htmldat.decode(encoder) # encode converts a string to bytes in that encoding with open(file, 'wb') as f: f.write(htmlstr.encode('utf-8')) Just wondering? KevinH |
01-15-2017, 05:54 PM | #65 |
Witchman
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
|
@KevinH...I use BS to assure conversion of the html file to Unicode UTF8.
I have also just found the problem that was inhibiting proper Polish language displays in the Book Browser and in the Table of Contents. The plugin now displays the Polish language properly for all headings both in the Book Browser and in the contents.xhtml. The problem was actually caused by one regex function that I use to cleanup heading names. When I removed the regex function everything came right. I'm still testing the plugin now with different European language ebooks just to make sure. I will probably release the new version(v0.2.8) sometime today. And thanks also for your advice above. I will store those code bits for later use in my utils library. Last edited by slowsmile; 01-15-2017 at 05:59 PM. |
Advert | |
|
01-15-2017, 06:24 PM | #66 |
Sigil Developer
Posts: 7,669
Karma: 5433388
Join Date: Nov 2009
Device: many
|
FWIW... The encode or decode routines or file io approaches will accomplish exactly that without requiring a full parse cycle.
Glad to hear you tracked down and fixed the bug. Nicely done! Thanks! |
01-15-2017, 06:36 PM | #67 |
Witchman
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
|
@st_albert...I've just had a look at the html for your ebook. The formatting is fine and, as I've said, because you've linked all your text styles to "Text Body" this is why all your own style names are also being displayed in the epub.
When I converted your ebook to epub using the plugin it converted without any problems at all. I did find a problem with the begin read location in the guide section of the content.opf file -- My code code not find "Chapter One". The problem was caused because you used chapter headings of the form "Chapter One", "Chapter Two" etc rather than using "Chapter 1", "Chapter 2". When I changed your headings to the latter form your Galactic Frontiers epub also passed EpubCheck. I will try and put in a fix to accommodate chapter headings like "Chapter One", "Chapter Two". This will hopefully be done today. I'm currently testing another problem which has also been fixed. The fix for your problem will probably be in v0.2.8. I've also sent you the epub version of your ebook that was converted using my plugin. See below. Last edited by slowsmile; 01-15-2017 at 07:33 PM. |
01-15-2017, 06:59 PM | #68 |
Witchman
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
|
@Doitsu & @Kevin...Also grateful for your explanations concerning NTFS UTF16 etc. That one had me gnawing my ankles with frustration...
|
01-15-2017, 08:06 PM | #69 |
Witchman
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
|
To avoid any confusion about how my plugin converts or manipulates the user's inline styles and named text styles in html, I thought that I should explain it more for some clarity.
My plugin is probably unique as a converter in that it reformats or manipulates all html text styles(classes) and in-tag styling on 3 levels: * If you have linked all your text styles to "Text Body" in OO or LO then all your named styles will show as classes in the html file. These user named styles will also therefore be ported to and will show in the epub as well. * If you have not used named styles in OO or LO or if your text styles do not inherit "Text Body" then the plugin will use a complex algorithm(yes, the function code is a bit horrifying but it nevertheless works well) and do its best to determine what your inline text styling does and then it will convert your inline styling to a suitably named text class. There are four core text styles that are used for this in the epub CSS: ebk-centered-text, ebk-blocktext, ebk-text-with-indent and ebk-text-no-indent. This feature also helps to reduce the number of meaningless prefixed/indexed classes(which I have always disliked) in the epub html. * Any in-tag text styling that cannot be determined will be converted to prefixed/indexed named classes of the form: ebk-5, ebk-12, ebk-23 etc. In other words, from the above, my plugin app will try to adjust to the way you have styled your ODT doc and will give you the epub that you deserve. So if you've used named text styles linked with "Text Body" throughout your doc, then your epub html will look good and will be easy to work with. But if you style your doc without using "Text Body" or named styles -- your epub html won't look so good. Like I said, you get what you deserve with this plugin according to your own styling efforts within the ODT doc. I also have to say that I couldn't have achieved the above without bs4 and pytidylib. And regarding html manipulations -- I'm now convinced that you can do anything you like in html using bs4. Anything. Last edited by slowsmile; 01-16-2017 at 01:11 AM. |
01-15-2017, 11:00 PM | #70 |
Witchman
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
|
@bravosx...Your problem has now been fixed in v0.2.8 which has just been released. Your ebook now displays correct Polish in both the Book Browser and in the Table of Contents(content.xhtml).
When you run EpubCheck you might also get this Warning: WARNING(PKG-012): File name contains the following non-ascii characters: ?. Consider changing the filename. This is only a warning -- not an error. The above warning should be ignored and will not stop you uploading your ebook to Kindle or other epub vendors without errors(I have also tested this). This incorrect warning is probably because EpubCheck does not use utf8 to check internal epub file names. Last edited by slowsmile; 01-16-2017 at 02:23 AM. |
01-15-2017, 11:06 PM | #71 |
Witchman
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
|
@st_albert...I've also fixed the problem concerning your use of "Chapter One", "Chapter Two" etc and the begin read guides problem. Begin read location will now accept: "Chapter 1" or "Chapter One" or "1". This fix is in v0.2.8 which has just been released. See Changes in the release notes for more details.
Last edited by slowsmile; 01-15-2017 at 11:10 PM. |
01-16-2017, 08:26 AM | #72 | |
Connoisseur
Posts: 99
Karma: 10
Join Date: Jun 2014
Location: Poland, Żory
Device: Prestigio PER3464B, Onyx Lynx, Lenovo S5000 i Tab4-8"
|
@slowsmile...
Quote:
Now to the heart of the matter. I made a set of trials with a larger text volume saving in LibreOffice as Document HTML (Writer) w Tools tab>Options>Load/Save>HTML Compatibility. In the Character set dropdown select: 1) Central European (Windows-1250/WinLatin2) 2) Western European (Windows-1252/WinLatin1) 3) UNICODE (UTF-8) that contains: - Prologue (prolog.text with Polish characters), - Twelve chapters (each named according to the following formula: 1. the text of Polish characters, 2. the text of the Polish characters etc.), - And epilogue (epilog.text with Polish characters). In the Book Browser/folder_Text in all sections are correctly displayed Polish characters. However, in the Table Of Contents window display only words: prolog, numbers from 1 to 12 and epilog (all without additional text). I added in the file. ODT in the names of the chapters the word Rozdział (for example. Rozdział 1. additional text) and saved as html, and then import the Sigil using plug-ins in the Table Of Contents display the words: Prolog, Rozdział 1-12 and Epilog (all without additional text). But this is not a problem, because using Ctrl+T and by confirming OK, the entry in the text window is fixed TOC display properly saved correctly showing Polish characters. After starting EpubCheck received this warning: WARNING(PKG-012): File name contains the following non-ascii characters: ?. Consider changing the filename. But as you write in post # 70 this is not a problem. You can get rid of this message by changing in the Book Browser/folder_Text section name so that there was no Polish characters. I can't put on the forum of the starting material used and converted and saved as an html file and epub, because I'm not sure what the copyright text. Alternatively, for inspection on e'mail. To summarize my lengthy text, I can confirm that for my Polish language plugin works properly. Once again sorry for my very poor knowledge of English and a big thanks. I greet all the members of this forum. bravosx |
|
01-16-2017, 01:09 PM | #73 |
Guru
Posts: 696
Karma: 150000
Join Date: Feb 2010
Device: none
|
@slowsmile
Thanks for your efforts. However, on my linux (Kubuntu, xenial, 16.04.1) box I'm still getting the same problem. Note that I am not using "bundled python" on this OS. Python version is Python version: 3.5.2 (default, Nov 17 2016, 17:05:23) on this machine. Thinking it might be due to an OS problem, I installed sigil 0.9.7 on a Win-10 x64 machine, using bundled Python. The testplugin ran successfully, but the OpenDoc import plugin failed with the following log: Spoiler:
The tidy.dll library exists in the path shown above. This happened with both versions 0.2.7 and 0.2.8 of the plugin. What OS are you using? Albert |
01-16-2017, 01:48 PM | #74 | |
Grand Sorcerer
Posts: 5,584
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
|
Quote:
OTOH, I successfully tested the new 0.2.8 version on my Windows machine. Did you install the official Sigil release or a portable version? @slowsmile: Unless you have access to a Linux machine, you might want to remove Linux from the list of supported operating systems in plugin.xml. |
|
01-16-2017, 02:15 PM | #75 | |
Guru
Posts: 696
Karma: 150000
Join Date: Feb 2010
Device: none
|
Quote:
As for the Win-10 version, I got the install file directly from Sigil-Ebook on Github. It installed smoothly, and even installed the MS runtime stuff with no problem. And, as I said, it passed the testplugin (ver 0.13 IIRC). Albert |
|
Tags |
conversion, epub, html, odf, opendoc |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
html to epub conversion | andin1 | Conversion | 1 | 03-12-2013 06:38 PM |
Nightmare epub: it's full of tables (conversion from CHM?) | MelBr | Conversion | 2 | 02-23-2013 11:28 AM |
html to epub CLI conversion / html input | m4mmon | Conversion | 2 | 05-05-2012 02:10 AM |
Help with HTML to ePub conversion...? | Nethfel | Calibre | 4 | 05-10-2010 02:26 PM |
Converting ODF to ePub with ODFToEPub | wdonne | News | 0 | 04-22-2010 05:28 AM |