Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 10-26-2009, 06:35 AM   #1
gsz
Junior Member
gsz began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Oct 2009
Device: Sony PRS-505
'utf8' codec can't decode bytes error (HTML to EPUB conversion)

Hello,

I'm using Calibre 0.6.19 to convert a bunch of HTML (a CHM originally) to EPUB and I'm getting this error:

Creating EPUB Output...
Looking for large trees in ch06lev1sec8.html...
No large trees found
Looking for large trees in fm01lev1sec1.html...
No large trees found
Looking for large trees in ch14lev1sec2.html...
No large trees found
Splitting on page-break
Splitting on page-break
Python function terminated unexpectedly
('utf8', '/*/*[2]/P\xf7/\x04:h2', 9, 13, 'invalid data') (Error Code: 1)
Traceback (most recent call last):
File "site.py", line 103, in main
File "site.py", line 85, in run_entry_point
File "site-packages\calibre\utils\ipc\worker.py", line 90, in main
File "site-packages\calibre\gui2\convert\gui_conversion.py", line 19, in gui_convert
File "site-packages\calibre\ebooks\conversion\plumber.py", line 827, in run
File "site-packages\calibre\ebooks\epub\output.py", line 162, in convert
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 56, in __call__
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 66, in split_item
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 175, in __init__
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 215, in split_on_page_breaks
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 283, in do_split
File "lxml.etree.pyx", line 1621, in lxml.etree._ElementTree.getpath (src/lxml/lxml.etree.c:17041)
File "apihelpers.pxi", line 1130, in lxml.etree.funicode (src/lxml/lxml.etree.c:36925)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 9-12: invalid data

Any idea how I can fix this? I can't even figure out where to look for this invalid data. Would it be somewhere in ch14lev1sec2.html, since that's the last file listed? The header in that file says it's 8859-1, not utf-8...
gsz is offline   Reply With Quote
Old 10-26-2009, 11:02 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,843
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
yeah that would be the first place to look.
kovidgoyal is offline   Reply With Quote
Old 10-26-2009, 01:20 PM   #3
gsz
Junior Member
gsz began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Oct 2009
Device: Sony PRS-505
Hello Kovid,

I appreciate your response, and all the work you put into this application.

I did a couple things to determine if ch14lev1sec2.html was the culprit. I converted it to ASCII and compared with the original (no difference between the converted file and the original), I did the same with the two css files it refers to (no difference). So I'm pretty sure this file is ISO-8851-1 (in fact, it's ASCII).

I then replaced it with an empty html file (nothing but an empty head and body), and sure enough the problem occurred elsewhere:

Creating EPUB Output...
Looking for large trees in ch06lev1sec8.html...
No large trees found
Looking for large trees in fm01lev1sec1.html...
No large trees found
Looking for large trees in ch14lev1sec2.html...
No large trees found
Looking for large trees in ch11lev1sec8.html...
No large trees found
Looking for large trees in ch04lev1sec4.html...
No large trees found
Looking for large trees in app03lev1sec3.html...
No large trees found
Splitting on page-break
Looking for large trees in F.html...
No large trees found
Split into 2 parts
Splitting on page-break
Splitting on page-break
Python function terminated unexpectedly
('utf8', '/*/*[2]/\x18\x81\x8d\x03:h2', 9, 10, 'unexpected code byte') (Error Code: 1)
Traceback (most recent call last):
File "site.py", line 103, in main
File "site.py", line 85, in run_entry_point
File "site-packages\calibre\utils\ipc\worker.py", line 90, in main
File "site-packages\calibre\gui2\convert\gui_conversion.py", line 19, in gui_convert
File "site-packages\calibre\ebooks\conversion\plumber.py", line 827, in run
File "site-packages\calibre\ebooks\epub\output.py", line 162, in convert
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 56, in __call__
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 66, in split_item
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 175, in __init__
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 215, in split_on_page_breaks
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 283, in do_split
File "lxml.etree.pyx", line 1621, in lxml.etree._ElementTree.getpath (src/lxml/lxml.etree.c:17041)
File "apihelpers.pxi", line 1130, in lxml.etree.funicode (src/lxml/lxml.etree.c:36925)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 9: unexpected code byte

Then I thought I will do a "binary search" on the file: removed the first half and ran conversion, then removed the second half and ran conversion. However this strategy failed, because the conversion failed with the above error in both of these cases (that is, it failed at processing a different file, F.html).

At this point I put back the original HTML and rerun conversion which failed again at the above location, that is, at F.html!

Then I deleted all the HTML files, reextracted them from the CHM and ran conversion again without changing anything, just to see what happens and now it failed at yet another location:

Creating EPUB Output...
Looking for large trees in ch06lev1sec8.html...
No large trees found
Looking for large trees in fm01lev1sec1.html...
No large trees found
Looking for large trees in fm01lev1sec7.html...
No large trees found
Looking for large trees in ch11lev1sec8.html...
No large trees found
Looking for large trees in ch07lev1sec2.html...
No large trees found
Splitting on page-break
Splitting on page-break
Python function terminated unexpectedly
('utf8', '/*/*[2]/\x90\x1fl\x04:h2', 8, 9, 'unexpected code byte') (Error Code: 1)
Traceback (most recent call last):
File "site.py", line 103, in main
File "site.py", line 85, in run_entry_point
File "site-packages\calibre\utils\ipc\worker.py", line 90, in main
File "site-packages\calibre\gui2\convert\gui_conversion.py", line 19, in gui_convert
File "site-packages\calibre\ebooks\conversion\plumber.py", line 827, in run
File "site-packages\calibre\ebooks\epub\output.py", line 162, in convert
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 56, in __call__
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 66, in split_item
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 175, in __init__
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 215, in split_on_page_breaks
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 283, in do_split
File "lxml.etree.pyx", line 1621, in lxml.etree._ElementTree.getpath (src/lxml/lxml.etree.c:17041)
File "apihelpers.pxi", line 1130, in lxml.etree.funicode (src/lxml/lxml.etree.c:36925)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x90 in position 8: unexpected code byte

I can see it hadn't processed F.html or ch14lev1sec2.html which suggests that the order of processing these files is somewhat random, or it may depend on the order in which they appear in the directory on the file system (=random), which makes troubleshooting or trying to pinpoint the error a bit more complicated.

So I thought maybe I ask if you have any suggestion what I should try next?
gsz is offline   Reply With Quote
Old 10-26-2009, 01:27 PM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,843
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
The clue is this line

('utf8', '/*/*[2]/\x90\x1fl\x04:h2', 8, 9, 'unexpected code byte') (Error Code: 1)

That means that there is some file where the tags belong to a non ascii XML namespace.
kovidgoyal is offline   Reply With Quote
Old 10-26-2009, 03:03 PM   #5
gsz
Junior Member
gsz began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Oct 2009
Device: Sony PRS-505
I verified the following:
- all files are 7-bit ASCII (there isn't a single byte anywhere in these files > 0x7f), except the images
- the following strings cannot be found anywhere in these files: "xmlns", ":h2"

Any other suggestion perhaps?
gsz is offline   Reply With Quote
Old 10-26-2009, 03:14 PM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,843
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
search for the string xmlns
kovidgoyal is offline   Reply With Quote
Old 10-26-2009, 03:18 PM   #7
gsz
Junior Member
gsz began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Oct 2009
Device: Sony PRS-505
As I mention above, I have already done that and there was no match.
gsz is offline   Reply With Quote
Old 10-26-2009, 03:20 PM   #8
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,843
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
sorry missed that. I'm at a loss, open a ticket and attach your html.
kovidgoyal is offline   Reply With Quote
Old 10-26-2009, 04:32 PM   #9
gsz
Junior Member
gsz began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Oct 2009
Device: Sony PRS-505
I spent a bit more time narrowing this down and here's what I came up with:

The entire content is a single HTML file (removed everything else):

--------------
<html>
<META http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<head>
<title>Chapter 1. Introduction</title>
</head>
<body>
<table width="100%" border="0" cellspacing="0" cellpadding="0"><TR><td valign="top"><a name="ch01"></a><h2>Chapter 1. Introduction</H2>
</td></tr></table>
</body></html>
--------------

If I try to convert this with "Linearize tables" turned on, it fails:

--------------
Creating EPUB Output...
Splitting on page-break
Splitting on page-break
Python function terminated unexpectedly
('utf8', '/*/*[2]/ha\xb6\x02:h2', 10, 11, 'unexpected code byte') (Error Code: 1)
Traceback (most recent call last):
File "site.py", line 103, in main
File "site.py", line 85, in run_entry_point
File "site-packages\calibre\utils\ipc\worker.py", line 90, in main
File "site-packages\calibre\gui2\convert\gui_conversion.py", line 19, in gui_convert
File "site-packages\calibre\ebooks\conversion\plumber.py", line 827, in run
File "site-packages\calibre\ebooks\epub\output.py", line 162, in convert
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 56, in __call__
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 66, in split_item
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 175, in __init__
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 215, in split_on_page_breaks
File "site-packages\calibre\ebooks\oeb\transforms\split.py", line 283, in do_split
File "lxml.etree.pyx", line 1621, in lxml.etree._ElementTree.getpath (src/lxml/lxml.etree.c:17041)
File "apihelpers.pxi", line 1130, in lxml.etree.funicode (src/lxml/lxml.etree.c:36925)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb6 in position 10: unexpected code byte
--------------

If I turn "Linearize tables" off, it works!

I hope this helps...
gsz is offline   Reply With Quote
Old 10-26-2009, 04:42 PM   #10
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,843
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
That fragment works for me with or without linearize tables. I'm guessing this is a windows specific issue. Open a ticket, the next time I'm dealing with windows problems, I'll have a look.
kovidgoyal is offline   Reply With Quote
Old 10-26-2009, 06:29 PM   #11
gsz
Junior Member
gsz began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Oct 2009
Device: Sony PRS-505
I opened a ticket. I also noticed that conversion to lrf works OK.

Thanks for your help.
gsz is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Conversion memory error (HTML->EPUB) doremifaso Calibre 4 06-25-2010 10:56 PM
Unicode decode error HansTWN Calibre 15 12-11-2009 07:51 PM
Problem reading UTF8 Chinese HTML files kkttmm3 iRex 20 07-29-2009 02:59 AM
HTML Conversion Error dedicated Calibre 12 12-18-2008 02:36 PM
lrfviewer & reader error after html conversion BrendenM Calibre 3 09-16-2008 11:40 AM


All times are GMT -4. The time now is 05:29 PM.


MobileRead.com is a privately owned, operated and funded community.