10-26-2009, 06:35 AM | #1 |
Junior Member
Posts: 6
Karma: 10
Join Date: Oct 2009
Device: Sony PRS-505
|
'utf8' codec can't decode bytes error (HTML to EPUB conversion)
Hello,
I'm using Calibre 0.6.19 to convert a bunch of HTML (a CHM originally) to EPUB and I'm getting this error: Creating EPUB Output... Looking for large trees in ch06lev1sec8.html... No large trees found Looking for large trees in fm01lev1sec1.html... No large trees found Looking for large trees in ch14lev1sec2.html... No large trees found Splitting on page-break Splitting on page-break Python function terminated unexpectedly ('utf8', '/*/*[2]/P\xf7/\x04:h2', 9, 13, 'invalid data') (Error Code: 1) Traceback (most recent call last): File "site.py", line 103, in main File "site.py", line 85, in run_entry_point File "site-packages\calibre\utils\ipc\worker.py", line 90, in main File "site-packages\calibre\gui2\convert\gui_conversion.py", line 19, in gui_convert File "site-packages\calibre\ebooks\conversion\plumber.py", line 827, in run File "site-packages\calibre\ebooks\epub\output.py", line 162, in convert File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 56, in __call__ File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 66, in split_item File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 175, in __init__ File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 215, in split_on_page_breaks File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 283, in do_split File "lxml.etree.pyx", line 1621, in lxml.etree._ElementTree.getpath (src/lxml/lxml.etree.c:17041) File "apihelpers.pxi", line 1130, in lxml.etree.funicode (src/lxml/lxml.etree.c:36925) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 9-12: invalid data Any idea how I can fix this? I can't even figure out where to look for this invalid data. Would it be somewhere in ch14lev1sec2.html, since that's the last file listed? The header in that file says it's 8859-1, not utf-8... |
10-26-2009, 11:02 AM | #2 |
creator of calibre
Posts: 44,356
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
yeah that would be the first place to look.
|
Advert | |
|
10-26-2009, 01:20 PM | #3 |
Junior Member
Posts: 6
Karma: 10
Join Date: Oct 2009
Device: Sony PRS-505
|
Hello Kovid,
I appreciate your response, and all the work you put into this application. I did a couple things to determine if ch14lev1sec2.html was the culprit. I converted it to ASCII and compared with the original (no difference between the converted file and the original), I did the same with the two css files it refers to (no difference). So I'm pretty sure this file is ISO-8851-1 (in fact, it's ASCII). I then replaced it with an empty html file (nothing but an empty head and body), and sure enough the problem occurred elsewhere: Creating EPUB Output... Looking for large trees in ch06lev1sec8.html... No large trees found Looking for large trees in fm01lev1sec1.html... No large trees found Looking for large trees in ch14lev1sec2.html... No large trees found Looking for large trees in ch11lev1sec8.html... No large trees found Looking for large trees in ch04lev1sec4.html... No large trees found Looking for large trees in app03lev1sec3.html... No large trees found Splitting on page-break Looking for large trees in F.html... No large trees found Split into 2 parts Splitting on page-break Splitting on page-break Python function terminated unexpectedly ('utf8', '/*/*[2]/\x18\x81\x8d\x03:h2', 9, 10, 'unexpected code byte') (Error Code: 1) Traceback (most recent call last): File "site.py", line 103, in main File "site.py", line 85, in run_entry_point File "site-packages\calibre\utils\ipc\worker.py", line 90, in main File "site-packages\calibre\gui2\convert\gui_conversion.py", line 19, in gui_convert File "site-packages\calibre\ebooks\conversion\plumber.py", line 827, in run File "site-packages\calibre\ebooks\epub\output.py", line 162, in convert File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 56, in __call__ File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 66, in split_item File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 175, in __init__ File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 215, in split_on_page_breaks File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 283, in do_split File "lxml.etree.pyx", line 1621, in lxml.etree._ElementTree.getpath (src/lxml/lxml.etree.c:17041) File "apihelpers.pxi", line 1130, in lxml.etree.funicode (src/lxml/lxml.etree.c:36925) UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 9: unexpected code byte Then I thought I will do a "binary search" on the file: removed the first half and ran conversion, then removed the second half and ran conversion. However this strategy failed, because the conversion failed with the above error in both of these cases (that is, it failed at processing a different file, F.html). At this point I put back the original HTML and rerun conversion which failed again at the above location, that is, at F.html! Then I deleted all the HTML files, reextracted them from the CHM and ran conversion again without changing anything, just to see what happens and now it failed at yet another location: Creating EPUB Output... Looking for large trees in ch06lev1sec8.html... No large trees found Looking for large trees in fm01lev1sec1.html... No large trees found Looking for large trees in fm01lev1sec7.html... No large trees found Looking for large trees in ch11lev1sec8.html... No large trees found Looking for large trees in ch07lev1sec2.html... No large trees found Splitting on page-break Splitting on page-break Python function terminated unexpectedly ('utf8', '/*/*[2]/\x90\x1fl\x04:h2', 8, 9, 'unexpected code byte') (Error Code: 1) Traceback (most recent call last): File "site.py", line 103, in main File "site.py", line 85, in run_entry_point File "site-packages\calibre\utils\ipc\worker.py", line 90, in main File "site-packages\calibre\gui2\convert\gui_conversion.py", line 19, in gui_convert File "site-packages\calibre\ebooks\conversion\plumber.py", line 827, in run File "site-packages\calibre\ebooks\epub\output.py", line 162, in convert File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 56, in __call__ File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 66, in split_item File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 175, in __init__ File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 215, in split_on_page_breaks File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 283, in do_split File "lxml.etree.pyx", line 1621, in lxml.etree._ElementTree.getpath (src/lxml/lxml.etree.c:17041) File "apihelpers.pxi", line 1130, in lxml.etree.funicode (src/lxml/lxml.etree.c:36925) UnicodeDecodeError: 'utf8' codec can't decode byte 0x90 in position 8: unexpected code byte I can see it hadn't processed F.html or ch14lev1sec2.html which suggests that the order of processing these files is somewhat random, or it may depend on the order in which they appear in the directory on the file system (=random), which makes troubleshooting or trying to pinpoint the error a bit more complicated. So I thought maybe I ask if you have any suggestion what I should try next? |
10-26-2009, 01:27 PM | #4 |
creator of calibre
Posts: 44,356
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
The clue is this line
('utf8', '/*/*[2]/\x90\x1fl\x04:h2', 8, 9, 'unexpected code byte') (Error Code: 1) That means that there is some file where the tags belong to a non ascii XML namespace. |
10-26-2009, 03:03 PM | #5 |
Junior Member
Posts: 6
Karma: 10
Join Date: Oct 2009
Device: Sony PRS-505
|
I verified the following:
- all files are 7-bit ASCII (there isn't a single byte anywhere in these files > 0x7f), except the images - the following strings cannot be found anywhere in these files: "xmlns", ":h2" Any other suggestion perhaps? |
Advert | |
|
10-26-2009, 03:14 PM | #6 |
creator of calibre
Posts: 44,356
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
search for the string xmlns
|
10-26-2009, 03:18 PM | #7 |
Junior Member
Posts: 6
Karma: 10
Join Date: Oct 2009
Device: Sony PRS-505
|
As I mention above, I have already done that and there was no match.
|
10-26-2009, 03:20 PM | #8 |
creator of calibre
Posts: 44,356
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
sorry missed that. I'm at a loss, open a ticket and attach your html.
|
10-26-2009, 04:32 PM | #9 |
Junior Member
Posts: 6
Karma: 10
Join Date: Oct 2009
Device: Sony PRS-505
|
I spent a bit more time narrowing this down and here's what I came up with:
The entire content is a single HTML file (removed everything else): -------------- <html> <META http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <head> <title>Chapter 1. Introduction</title> </head> <body> <table width="100%" border="0" cellspacing="0" cellpadding="0"><TR><td valign="top"><a name="ch01"></a><h2>Chapter 1. Introduction</H2> </td></tr></table> </body></html> -------------- If I try to convert this with "Linearize tables" turned on, it fails: -------------- Creating EPUB Output... Splitting on page-break Splitting on page-break Python function terminated unexpectedly ('utf8', '/*/*[2]/ha\xb6\x02:h2', 10, 11, 'unexpected code byte') (Error Code: 1) Traceback (most recent call last): File "site.py", line 103, in main File "site.py", line 85, in run_entry_point File "site-packages\calibre\utils\ipc\worker.py", line 90, in main File "site-packages\calibre\gui2\convert\gui_conversion.py", line 19, in gui_convert File "site-packages\calibre\ebooks\conversion\plumber.py", line 827, in run File "site-packages\calibre\ebooks\epub\output.py", line 162, in convert File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 56, in __call__ File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 66, in split_item File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 175, in __init__ File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 215, in split_on_page_breaks File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 283, in do_split File "lxml.etree.pyx", line 1621, in lxml.etree._ElementTree.getpath (src/lxml/lxml.etree.c:17041) File "apihelpers.pxi", line 1130, in lxml.etree.funicode (src/lxml/lxml.etree.c:36925) UnicodeDecodeError: 'utf8' codec can't decode byte 0xb6 in position 10: unexpected code byte -------------- If I turn "Linearize tables" off, it works! I hope this helps... |
10-26-2009, 04:42 PM | #10 |
creator of calibre
Posts: 44,356
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
That fragment works for me with or without linearize tables. I'm guessing this is a windows specific issue. Open a ticket, the next time I'm dealing with windows problems, I'll have a look.
|
10-26-2009, 06:29 PM | #11 |
Junior Member
Posts: 6
Karma: 10
Join Date: Oct 2009
Device: Sony PRS-505
|
I opened a ticket. I also noticed that conversion to lrf works OK.
Thanks for your help. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Conversion memory error (HTML->EPUB) | doremifaso | Calibre | 4 | 06-25-2010 10:56 PM |
Unicode decode error | HansTWN | Calibre | 15 | 12-11-2009 07:51 PM |
Problem reading UTF8 Chinese HTML files | kkttmm3 | iRex | 20 | 07-29-2009 02:59 AM |
HTML Conversion Error | dedicated | Calibre | 12 | 12-18-2008 02:36 PM |
lrfviewer & reader error after html conversion | BrendenM | Calibre | 3 | 09-16-2008 11:40 AM |