07-31-2012, 02:43 PM | #1 |
onlinenewsreader.net
Posts: 324
Karma: 10143
Join Date: Dec 2009
Location: Phoenix, AZ & Victoria, BC
Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire
|
lxml.etree._utf8 crash
This is probably a question for Kovid.
I'm getting a trap in lxml.etree._utf8 with the message "ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters" With recursions=0 and simultaneous downloads=1 this crashes ebook-convert with the following traceback Code:
Python function terminated unexpectedly All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters (Error Code: 1) Traceback (most recent call last): File "site.py", line 132, in main File "site.py", line 109, in run_entry_point File "site-packages\calibre\ebooks\conversion\cli.py", line 325, in main File "site-packages\calibre\ebooks\conversion\plumber.py", line 979, in run File "site-packages\calibre\customize\conversion.py", line 208, in __call__ File "site-packages\calibre\ebooks\conversion\plugins\recipe_input.py", line 105, in convert File "site-packages\calibre\web\feeds\news.py", line 881, in download File "site-packages\calibre\web\feeds\news.py", line 1130, in build_index File "site-packages\calibre\web\feeds\news.py", line 974, in feed2index File "site-packages\calibre\web\feeds\templates.py", line 43, in generate File "site-packages\calibre\web\feeds\templates.py", line 177, in _generate File "site-packages\lxml\builder.py", line 222, in __call__ File "site-packages\lxml\builder.py", line 185, in add_text File "lxml.etree.pyx", line 916, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:36134) File "apihelpers.pxi", line 721, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:17141) File "apihelpers.pxi", line 1366, in lxml.etree._utf8 (src/lxml/lxml.etree.c:22211) ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters Code:
Parsing feed_1/article_4/index.html as HTML HTML 5 parsing failed, falling back to older parsers Traceback (most recent call last): File "site-packages\calibre\ebooks\oeb\parse_utils.py", line 259, in parse_html File "site-packages\calibre\ebooks\oeb\parse_utils.py", line 86, in html5_parse File "site-packages\html5lib\html5parser.py", line 38, in parse File "site-packages\html5lib\html5parser.py", line 211, in parse File "site-packages\html5lib\html5parser.py", line 111, in _parse File "site-packages\html5lib\html5parser.py", line 179, in mainLoop File "site-packages\html5lib\html5parser.py", line 447, in processStartTag File "site-packages\html5lib\html5parser.py", line 725, in startTagMeta File "site-packages\html5lib\treebuilders\_base.py", line 259, in insertElementNormal File "site-packages\html5lib\treebuilders\etree_lxml.py", line 219, in _setAttributes File "site-packages\html5lib\treebuilders\etree_lxml.py", line 189, in __init__ File "lxml.etree.pyx", line 2145, in lxml.etree._Attrib.__setitem__ (src/lxml/lxml.etree.c:46818) File "apihelpers.pxi", line 563, in lxml.etree._setAttributeValue (src/lxml/lxml.etree.c:15781) File "apihelpers.pxi", line 1366, in lxml.etree._utf8 (src/lxml/lxml.etree.c:22211) ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters I've looked at the calibre source at http://bazaar.launchpad.net/~kovid/calibre/trunk/files and the line numbers in the tracebacks don't seem to line up so I'm at a loss here. My question: what is causing this and could calibre be made a little more bulletproof here? |
07-31-2012, 03:22 PM | #2 |
creator of calibre
Posts: 43,975
Karma: 22669822
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Your second traceback does not indicate anything crashed just that parsing with the HTML 5 parser failed, in which case calibre fallsback to using other parsers.
|
Advert | |
|
07-31-2012, 03:44 PM | #3 |
onlinenewsreader.net
Posts: 324
Karma: 10143
Join Date: Dec 2009
Location: Phoenix, AZ & Victoria, BC
Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire
|
What about the first traceback? It's the same error and traceback and ebook-convert crashes. I use recursions=0 and simultaneous_downloads=1 for recipe debugging purposes and this crash makes things very difficult.
|
07-31-2012, 06:15 PM | #4 |
onlinenewsreader.net
Posts: 324
Karma: 10143
Join Date: Dec 2009
Location: Phoenix, AZ & Victoria, BC
Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire
|
OK I ran this from source and the problem is some garbage characters in an article description. I think calibre should "fail softly" when encountering invalid character codes since recipes aren't able to control that and it happens from time to time on periodical websites--crashing isn't a good response.
Ignoring the illegal characters and issuing a warning message would be a much better response. Unfortunately this is something that should be done at the lxml level so making calibre more robust in this case is probably a task for Kovid rather than someone like me (I don't think I have the source for lxml as part of the bazaar download of calibre). I realize that this is only an issue when calibre is running single-threaded but still--it's a limitation for people who want to debug recipes single-threaded! |
08-01-2012, 12:30 AM | #5 |
creator of calibre
Posts: 43,975
Karma: 22669822
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Why is fixing lxml a task for me? I dont maintain lxml.You can get access to the lxml source https://launchpad.net/lxml
Though you will find that creating a parser that never fails no matter what garbage you feed it, is well-nigh impossible. |
Advert | |
|
08-01-2012, 05:09 PM | #6 |
onlinenewsreader.net
Posts: 324
Karma: 10143
Join Date: Dec 2009
Location: Phoenix, AZ & Victoria, BC
Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire
|
I would have thought ensuring calibre is robust would be a task for you since you are making a living off of it, but I can tell from your snarky attitude that discussing this further is pointless.
|
08-02-2012, 12:00 AM | #7 |
creator of calibre
Posts: 43,975
Karma: 22669822
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
I love it when people make the assumption that because I maintain calibre I am somehow obligated to drop everything and rush off to fix whatever they think needs to be fixed. You are asking me to spend time on an issue that is important to you. Do not assume that just because you think it is important, everyone else must share your opinion.
|
08-02-2012, 12:26 AM | #8 |
onlinenewsreader.net
Posts: 324
Karma: 10143
Join Date: Dec 2009
Location: Phoenix, AZ & Victoria, BC
Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire
|
If you cast your eyes back over this (very short) thread you won't find any suggestion that you should "drop everything and rush off" to fix this. I merely alerted you to the problem.
Your responses have been quite rude and inappropriate. If you don't care about an issue there is no need to get nasty about it. |
08-02-2012, 12:45 AM | #9 |
creator of calibre
Posts: 43,975
Karma: 22669822
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Lets see. Quoting you:
"I would have thought ensuring calibre is robust would be a task for you since you are making a living off of it" "Unfortunately this is something that should be done at the lxml level so making calibre more robust in this case is probably a task for Kovid" |
08-02-2012, 06:40 AM | #10 | |
Guru
Posts: 776
Karma: 2751519
Join Date: Jul 2010
Location: UK
Device: PW2, Nexus7
|
Quote:
I don't think that attempting to bully the developer with side-swipe comments is a good way to enlist his help in fixing a problem in a third party component. Perhaps you could make a non-monetary contribution by having a go at fixing the lxml problem yourself, now that Kovid has pointed you to the source. |
|
08-02-2012, 08:46 AM | #11 | |
onlinenewsreader.net
Posts: 324
Karma: 10143
Join Date: Dec 2009
Location: Phoenix, AZ & Victoria, BC
Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire
|
Quote:
|
|
08-02-2012, 08:47 AM | #12 | |
onlinenewsreader.net
Posts: 324
Karma: 10143
Join Date: Dec 2009
Location: Phoenix, AZ & Victoria, BC
Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire
|
Quote:
|
|
08-02-2012, 08:49 AM | #13 |
creator of calibre
Posts: 43,975
Karma: 22669822
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
|
08-02-2012, 09:25 AM | #14 |
onlinenewsreader.net
Posts: 324
Karma: 10143
Join Date: Dec 2009
Location: Phoenix, AZ & Victoria, BC
Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire
|
OK kettle, in the loop at line 160 in template.py, why not just wrap the li=LI(... and li.append(... statements in try:/except: and just put a message in the log and continue on the exception. Chances are the only reason those two statements would fail is illegal character codes--nothing to do with parsing structure so falling back to another parser wouldn't fix that.
I'm not suggesting you "rush off" and do that immediately though! |
08-02-2012, 09:34 AM | #15 |
creator of calibre
Posts: 43,975
Karma: 22669822
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Because pot, illegal characters are already stripped from both the title and the text_summary in the __init__ method of Article class. So your exception isn't because of illegal characters.
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
0.7.30 crash | nickredding | Calibre | 1 | 11-27-2010 01:40 PM |
Pseudo-crash w/V 6.39 | petercreasey | Calibre | 12 | 02-11-2010 05:59 AM |
calibre-0.6.31, mechanize and lxml | taurnil | Calibre | 5 | 01-01-2010 07:47 AM |
calibre python-lxml problem on ubuntu | carpii | Calibre | 5 | 11-29-2008 05:34 AM |
upgrade failed - but not python-lxml fault | alexxxm | Calibre | 7 | 10-06-2008 09:36 AM |