'utf8' codec can't decode bytes error (HTML to EPUB conversion)

gsz · 10-26-2009, 06:35 AM

Hello,

I'm using Calibre 0.6.19 to convert a bunch of HTML (a CHM originally) to EPUB and I'm getting this error:

Creating EPUB Output...
Looking for large trees in ch06lev1sec8.html...
No large trees found
Looking for large trees in fm01lev1sec1.html...
No large trees found
Looking for large trees in ch14lev1sec2.html...
No large trees found
Splitting on page-break
Splitting on page-break
Python function terminated unexpectedly
('utf8', '/*/*[2]/P\xf7/\x04:h2', 9, 13, 'invalid data') (Error Code: 1)
Traceback (most recent call last):
File "site.py", line 103, in main
File "site.py", line 85, in run_entry_point
File "site-packages\calibre\utils\ipc\worker.py", line 90, in main
File "site-packages\calibre\gui2\convert\gui_conversion.py", line 19, in gui_convert
File "site-packages\calibre\ebooks\conversion\plumber.py", line 827, in run
File "site-packages\calibre\ebooks\epub\output.py", line 162, in convert
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 56, in __call__
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 66, in split_item
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 175, in __init__
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 215, in split_on_page_breaks
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 283, in do_split
File "lxml.etree.pyx", line 1621, in lxml.etree._ElementTree.getpath (src/lxml/lxml.etree.c:17041)
File "apihelpers.pxi", line 1130, in lxml.etree.funicode (src/lxml/lxml.etree.c:36925)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 9-12: invalid data

Any idea how I can fix this? I can't even figure out where to look for this invalid data. Would it be somewhere in ch14lev1sec2.html, since that's the last file listed? The header in that file says it's 8859-1, not utf-8...

kovidgoyal · 10-26-2009, 11:02 AM

yeah that would be the first place to look.

gsz · 10-26-2009, 01:20 PM

Hello Kovid,

I appreciate your response, and all the work you put into this application.

I did a couple things to determine if ch14lev1sec2.html was the culprit. I converted it to ASCII and compared with the original (no difference between the converted file and the original), I did the same with the two css files it refers to (no difference). So I'm pretty sure this file is ISO-8851-1 (in fact, it's ASCII).

I then replaced it with an empty html file (nothing but an empty head and body), and sure enough the problem occurred elsewhere:

Creating EPUB Output...
Looking for large trees in ch06lev1sec8.html...
No large trees found
Looking for large trees in fm01lev1sec1.html...
No large trees found
Looking for large trees in ch14lev1sec2.html...
No large trees found
Looking for large trees in ch11lev1sec8.html...
No large trees found
Looking for large trees in ch04lev1sec4.html...
No large trees found
Looking for large trees in app03lev1sec3.html...
No large trees found
Splitting on page-break
Looking for large trees in F.html...
No large trees found
Split into 2 parts
Splitting on page-break
Splitting on page-break
Python function terminated unexpectedly
('utf8', '/*/*[2]/\x18\x81\x8d\x03:h2', 9, 10, 'unexpected code byte') (Error Code: 1)
Traceback (most recent call last):
File "site.py", line 103, in main
File "site.py", line 85, in run_entry_point
File "site-packages\calibre\utils\ipc\worker.py", line 90, in main
File "site-packages\calibre\gui2\convert\gui_conversion.py", line 19, in gui_convert
File "site-packages\calibre\ebooks\conversion\plumber.py", line 827, in run
File "site-packages\calibre\ebooks\epub\output.py", line 162, in convert
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 56, in __call__
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 66, in split_item
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 175, in __init__
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 215, in split_on_page_breaks
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 283, in do_split
File "lxml.etree.pyx", line 1621, in lxml.etree._ElementTree.getpath (src/lxml/lxml.etree.c:17041)
File "apihelpers.pxi", line 1130, in lxml.etree.funicode (src/lxml/lxml.etree.c:36925)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 9: unexpected code byte

Then I thought I will do a "binary search" on the file: removed the first half and ran conversion, then removed the second half and ran conversion. However this strategy failed, because the conversion failed with the above error in both of these cases (that is, it failed at processing a different file, F.html).

At this point I put back the original HTML and rerun conversion which failed again at the above location, that is, at F.html!

Then I deleted all the HTML files, reextracted them from the CHM and ran conversion again without changing anything, just to see what happens and now it failed at yet another location:

Creating EPUB Output...
Looking for large trees in ch06lev1sec8.html...
No large trees found
Looking for large trees in fm01lev1sec1.html...
No large trees found
Looking for large trees in fm01lev1sec7.html...
No large trees found
Looking for large trees in ch11lev1sec8.html...
No large trees found
Looking for large trees in ch07lev1sec2.html...
No large trees found
Splitting on page-break
Splitting on page-break
Python function terminated unexpectedly
('utf8', '/*/*[2]/\x90\x1fl\x04:h2', 8, 9, 'unexpected code byte') (Error Code: 1)
Traceback (most recent call last):
File "site.py", line 103, in main
File "site.py", line 85, in run_entry_point
File "site-packages\calibre\utils\ipc\worker.py", line 90, in main
File "site-packages\calibre\gui2\convert\gui_conversion.py", line 19, in gui_convert
File "site-packages\calibre\ebooks\conversion\plumber.py", line 827, in run
File "site-packages\calibre\ebooks\epub\output.py", line 162, in convert
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 56, in __call__
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 66, in split_item
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 175, in __init__
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 215, in split_on_page_breaks
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 283, in do_split
File "lxml.etree.pyx", line 1621, in lxml.etree._ElementTree.getpath (src/lxml/lxml.etree.c:17041)
File "apihelpers.pxi", line 1130, in lxml.etree.funicode (src/lxml/lxml.etree.c:36925)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x90 in position 8: unexpected code byte

I can see it hadn't processed F.html or ch14lev1sec2.html which suggests that the order of processing these files is somewhat random, or it may depend on the order in which they appear in the directory on the file system (=random), which makes troubleshooting or trying to pinpoint the error a bit more complicated.

So I thought maybe I ask if you have any suggestion what I should try next?

kovidgoyal · 10-26-2009, 01:27 PM

The clue is this line

('utf8', '/*/*[2]/\x90\x1fl\x04:h2', 8, 9, 'unexpected code byte') (Error Code: 1)

That means that there is some file where the tags belong to a non ascii XML namespace.

gsz · 10-26-2009, 03:03 PM

I verified the following:
- all files are 7-bit ASCII (there isn't a single byte anywhere in these files > 0x7f), except the images
- the following strings cannot be found anywhere in these files: "xmlns", ":h2"

Any other suggestion perhaps?

kovidgoyal · 10-26-2009, 03:14 PM

search for the string xmlns

gsz · 10-26-2009, 03:18 PM

As I mention above, I have already done that and there was no match.

kovidgoyal · 10-26-2009, 03:20 PM

sorry missed that. I'm at a loss, open a ticket and attach your html.

gsz · 10-26-2009, 04:32 PM

I spent a bit more time narrowing this down and here's what I came up with:

The entire content is a single HTML file (removed everything else):

--------------
<html>
<META http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<head>
<title>Chapter 1. Introduction</title>
</head>
<body>
<table width="100%" border="0" cellspacing="0" cellpadding="0"><TR><td valign="top"><a name="ch01"></a><h2>Chapter 1. Introduction</H2>
</td></tr></table>
</body></html>
--------------

If I try to convert this with "Linearize tables" turned on, it fails:

--------------
Creating EPUB Output...
Splitting on page-break
Splitting on page-break
Python function terminated unexpectedly
('utf8', '/*/*[2]/ha\xb6\x02:h2', 10, 11, 'unexpected code byte') (Error Code: 1)
Traceback (most recent call last):
File "site.py", line 103, in main
File "site.py", line 85, in run_entry_point
File "site-packages\calibre\utils\ipc\worker.py", line 90, in main
File "site-packages\calibre\gui2\convert\gui_conversion.py", line 19, in gui_convert
File "site-packages\calibre\ebooks\conversion\plumber.py", line 827, in run
File "site-packages\calibre\ebooks\epub\output.py", line 162, in convert
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 56, in __call__
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 66, in split_item
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 175, in __init__
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 215, in split_on_page_breaks
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 283, in do_split
File "lxml.etree.pyx", line 1621, in lxml.etree._ElementTree.getpath (src/lxml/lxml.etree.c:17041)
File "apihelpers.pxi", line 1130, in lxml.etree.funicode (src/lxml/lxml.etree.c:36925)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb6 in position 10: unexpected code byte
--------------

If I turn "Linearize tables" off, it works!

I hope this helps...

kovidgoyal · 10-26-2009, 04:42 PM

That fragment works for me with or without linearize tables. I'm guessing this is a windows specific issue. Open a ticket, the next time I'm dealing with windows problems, I'll have a look.

gsz · 10-26-2009, 06:29 PM

I opened a ticket. I also noticed that conversion to lrf works OK.

Thanks for your help.

10-26-2009, 06:35 AM	#1
gsz Junior Member Posts: 6 Karma: 10 Join Date: Oct 2009 Device: Sony PRS-505	'utf8' codec can't decode bytes error (HTML to EPUB conversion) Hello, I'm using Calibre 0.6.19 to convert a bunch of HTML (a CHM originally) to EPUB and I'm getting this error: Creating EPUB Output... Looking for large trees in ch06lev1sec8.html... No large trees found Looking for large trees in fm01lev1sec1.html... No large trees found Looking for large trees in ch14lev1sec2.html... No large trees found Splitting on page-break Splitting on page-break Python function terminated unexpectedly ('utf8', '//[2]/P\xf7/\x04:h2', 9, 13, 'invalid data') (Error Code: 1) Traceback (most recent call last): File "site.py", line 103, in main File "site.py", line 85, in run_entry_point File "site-packages\calibre\utils\ipc\worker.py", line 90, in main File "site-packages\calibre\gui2\convert\gui_conversion.py", line 19, in gui_convert File "site-packages\calibre\ebooks\conversion\plumber.py", line 827, in run File "site-packages\calibre\ebooks\epub\output.py", line 162, in convert File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 56, in __call__ File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 66, in split_item File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 175, in __init__ File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 215, in split_on_page_breaks File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 283, in do_split File "lxml.etree.pyx", line 1621, in lxml.etree._ElementTree.getpath (src/lxml/lxml.etree.c:17041) File "apihelpers.pxi", line 1130, in lxml.etree.funicode (src/lxml/lxml.etree.c:36925) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 9-12: invalid data Any idea how I can fix this? I can't even figure out where to look for this invalid data. Would it be somewhere in ch14lev1sec2.html, since that's the last file listed? The header in that file says it's 8859-1, not utf-8...

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Conversion memory error (HTML->EPUB)	doremifaso	Calibre	4	06-25-2010 10:56 PM
Unicode decode error	HansTWN	Calibre	15	12-11-2009 07:51 PM
Problem reading UTF8 Chinese HTML files	kkttmm3	iRex	20	07-29-2009 02:59 AM
HTML Conversion Error	dedicated	Calibre	12	12-18-2008 02:36 PM
lrfviewer & reader error after html conversion	BrendenM	Calibre	3	09-16-2008 11:40 AM

10-26-2009, 11:02 AM	#2
kovidgoyal creator of calibre Posts: 43,843 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	yeah that would be the first place to look.

10-26-2009, 01:20 PM	#3
gsz Junior Member Posts: 6 Karma: 10 Join Date: Oct 2009 Device: Sony PRS-505	Hello Kovid, I appreciate your response, and all the work you put into this application. I did a couple things to determine if ch14lev1sec2.html was the culprit. I converted it to ASCII and compared with the original (no difference between the converted file and the original), I did the same with the two css files it refers to (no difference). So I'm pretty sure this file is ISO-8851-1 (in fact, it's ASCII). I then replaced it with an empty html file (nothing but an empty head and body), and sure enough the problem occurred elsewhere: Creating EPUB Output... Looking for large trees in ch06lev1sec8.html... No large trees found Looking for large trees in fm01lev1sec1.html... No large trees found Looking for large trees in ch14lev1sec2.html... No large trees found Looking for large trees in ch11lev1sec8.html... No large trees found Looking for large trees in ch04lev1sec4.html... No large trees found Looking for large trees in app03lev1sec3.html... No large trees found Splitting on page-break Looking for large trees in F.html... No large trees found Split into 2 parts Splitting on page-break Splitting on page-break Python function terminated unexpectedly ('utf8', '//[2]/\x18\x81\x8d\x03:h2', 9, 10, 'unexpected code byte') (Error Code: 1) Traceback (most recent call last): File "site.py", line 103, in main File "site.py", line 85, in run_entry_point File "site-packages\calibre\utils\ipc\worker.py", line 90, in main File "site-packages\calibre\gui2\convert\gui_conversion.py", line 19, in gui_convert File "site-packages\calibre\ebooks\conversion\plumber.py", line 827, in run File "site-packages\calibre\ebooks\epub\output.py", line 162, in convert File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 56, in __call__ File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 66, in split_item File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 175, in __init__ File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 215, in split_on_page_breaks File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 283, in do_split File "lxml.etree.pyx", line 1621, in lxml.etree._ElementTree.getpath (src/lxml/lxml.etree.c:17041) File "apihelpers.pxi", line 1130, in lxml.etree.funicode (src/lxml/lxml.etree.c:36925) UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 9: unexpected code byte Then I thought I will do a "binary search" on the file: removed the first half and ran conversion, then removed the second half and ran conversion. However this strategy failed, because the conversion failed with the above error in both of these cases (that is, it failed at processing a different file, F.html). At this point I put back the original HTML and rerun conversion which failed again at the above location, that is, at F.html! Then I deleted all the HTML files, reextracted them from the CHM and ran conversion again without changing anything, just to see what happens and now it failed at yet another location: Creating EPUB Output... Looking for large trees in ch06lev1sec8.html... No large trees found Looking for large trees in fm01lev1sec1.html... No large trees found Looking for large trees in fm01lev1sec7.html... No large trees found Looking for large trees in ch11lev1sec8.html... No large trees found Looking for large trees in ch07lev1sec2.html... No large trees found Splitting on page-break Splitting on page-break Python function terminated unexpectedly ('utf8', '//[2]/\x90\x1fl\x04:h2', 8, 9, 'unexpected code byte') (Error Code: 1) Traceback (most recent call last): File "site.py", line 103, in main File "site.py", line 85, in run_entry_point File "site-packages\calibre\utils\ipc\worker.py", line 90, in main File "site-packages\calibre\gui2\convert\gui_conversion.py", line 19, in gui_convert File "site-packages\calibre\ebooks\conversion\plumber.py", line 827, in run File "site-packages\calibre\ebooks\epub\output.py", line 162, in convert File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 56, in __call__ File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 66, in split_item File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 175, in __init__ File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 215, in split_on_page_breaks File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 283, in do_split File "lxml.etree.pyx", line 1621, in lxml.etree._ElementTree.getpath (src/lxml/lxml.etree.c:17041) File "apihelpers.pxi", line 1130, in lxml.etree.funicode (src/lxml/lxml.etree.c:36925) UnicodeDecodeError: 'utf8' codec can't decode byte 0x90 in position 8: unexpected code byte I can see it hadn't processed F.html or ch14lev1sec2.html which suggests that the order of processing these files is somewhat random, or it may depend on the order in which they appear in the directory on the file system (=random), which makes troubleshooting or trying to pinpoint the error a bit more complicated. So I thought maybe I ask if you have any suggestion what I should try next?

10-26-2009, 01:27 PM	#4
kovidgoyal creator of calibre Posts: 43,843 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	The clue is this line ('utf8', '//[2]/\x90\x1fl\x04:h2', 8, 9, 'unexpected code byte') (Error Code: 1) That means that there is some file where the tags belong to a non ascii XML namespace.

10-26-2009, 03:03 PM	#5
gsz Junior Member Posts: 6 Karma: 10 Join Date: Oct 2009 Device: Sony PRS-505	I verified the following: - all files are 7-bit ASCII (there isn't a single byte anywhere in these files > 0x7f), except the images - the following strings cannot be found anywhere in these files: "xmlns", ":h2" Any other suggestion perhaps?

10-26-2009, 03:14 PM	#6
kovidgoyal creator of calibre Posts: 43,843 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	search for the string xmlns

10-26-2009, 03:18 PM	#7
gsz Junior Member Posts: 6 Karma: 10 Join Date: Oct 2009 Device: Sony PRS-505	As I mention above, I have already done that and there was no match.

10-26-2009, 03:20 PM	#8
kovidgoyal creator of calibre Posts: 43,843 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	sorry missed that. I'm at a loss, open a ticket and attach your html.

10-26-2009, 04:32 PM	#9
gsz Junior Member Posts: 6 Karma: 10 Join Date: Oct 2009 Device: Sony PRS-505	I spent a bit more time narrowing this down and here's what I came up with: The entire content is a single HTML file (removed everything else): -------------- <html> <META http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <head> <title>Chapter 1. Introduction</title> </head> <body> <table width="100%" border="0" cellspacing="0" cellpadding="0"><TR><td valign="top"><a name="ch01"></a><h2>Chapter 1. Introduction</H2> </td></tr></table> </body></html> -------------- If I try to convert this with "Linearize tables" turned on, it fails: -------------- Creating EPUB Output... Splitting on page-break Splitting on page-break Python function terminated unexpectedly ('utf8', '//[2]/ha\xb6\x02:h2', 10, 11, 'unexpected code byte') (Error Code: 1) Traceback (most recent call last): File "site.py", line 103, in main File "site.py", line 85, in run_entry_point File "site-packages\calibre\utils\ipc\worker.py", line 90, in main File "site-packages\calibre\gui2\convert\gui_conversion.py", line 19, in gui_convert File "site-packages\calibre\ebooks\conversion\plumber.py", line 827, in run File "site-packages\calibre\ebooks\epub\output.py", line 162, in convert File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 56, in __call__ File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 66, in split_item File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 175, in __init__ File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 215, in split_on_page_breaks File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 283, in do_split File "lxml.etree.pyx", line 1621, in lxml.etree._ElementTree.getpath (src/lxml/lxml.etree.c:17041) File "apihelpers.pxi", line 1130, in lxml.etree.funicode (src/lxml/lxml.etree.c:36925) UnicodeDecodeError: 'utf8' codec can't decode byte 0xb6 in position 10: unexpected code byte -------------- If I turn "Linearize tables" off, it works! I hope this helps...

10-26-2009, 04:42 PM	#10
kovidgoyal creator of calibre Posts: 43,843 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	That fragment works for me with or without linearize tables. I'm guessing this is a windows specific issue. Open a ticket, the next time I'm dealing with windows problems, I'll have a look.

10-26-2009, 06:29 PM	#11
gsz Junior Member Posts: 6 Karma: 10 Join Date: Oct 2009 Device: Sony PRS-505	I opened a ticket. I also noticed that conversion to lrf works OK. Thanks for your help.