01-14-2011, 11:29 AM | #1 |
Junior Member
Posts: 9
Karma: 10
Join Date: Jan 2011
Device: Sony PRS-650
|
Debugging intermittent failures, how to?
Dear all,
I am trying to debug intermittent failures with download and conversion of Globe&Mail articles. About ~5% or so of articles end up with missing text and I am at loss to understand a root cause. I cannot reproduce the problem with --test, yet on a full paper download I always end up with a couple of empty articles. Is there a way/setting/option to download and preserve all source unprocessed html files when using a class derived from BasicNewsRecipe and then just to rerun processing on pre-downloaded files? This would hopefully allow me to understand if it's a download or processing issue and ease the debugging. Thanks guys! /guterm |
01-14-2011, 11:35 AM | #2 |
creator of calibre
Posts: 43,843
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
--debug-pipeline
|
Advert | |
|
01-14-2011, 11:53 AM | #3 |
Junior Member
Posts: 9
Karma: 10
Join Date: Jan 2011
Device: Sony PRS-650
|
Thank you! Managed to progress further.
The failing articles in the "input" directory are already empty. For each failing article I see the following error message: Code:
Parsing feed_4/article_0/index.html ... Initial parse failed: Traceback (most recent call last): File "site-packages\calibre\ebooks\oeb\base.py", line 857, in first_pass File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etre e.c:48634) File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxm l.etree.c:72245) File "parser.pxi", line 1417, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:7 1041) File "parser.pxi", line 898, in lxml.etree._BaseParser._parseUnicodeDoc (src/l xml/lxml.etree.c:67581) File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDo c (src/lxml/lxml.etree.c:64257) File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.e tree.c:65178) File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etr ee.c:64521) XMLSyntaxError: xmlParseEntityRef: no name, line 4, column 17 |
01-14-2011, 12:00 PM | #4 |
creator of calibre
Posts: 43,843
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
If they are empty in the input directory then the recipe is downloading them empty. That usually means your remove_Tags and similar settings are too aggressive.
|
01-14-2011, 12:08 PM | #5 |
Junior Member
Posts: 9
Karma: 10
Join Date: Jan 2011
Device: Sony PRS-650
|
Thank you, let me play with that.
Is there a way to preserve the EXACT html that got downloaded? I already confirmed that globe&mail actively varies output/structure of the same article from one download to another, would be good to find the exact problem. /guterm |
Advert | |
|
01-14-2011, 12:45 PM | #6 |
creator of calibre
Posts: 43,843
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Use the preprocess_html function and save the html to a file
|
01-14-2011, 05:21 PM | #7 |
Junior Member
Posts: 9
Karma: 10
Join Date: Jan 2011
Device: Sony PRS-650
|
Thank you, that's most helpful, the server was switching div classes back and forth.
Somewhat related question, is there a way to completely delete an article after looking at it's content? E.g., I parse an article and I am finding that it's nothing but a slideshow or a video, is there a way to remove it from the feed index without corrupting processing? Unfortunately I cannot determine that based on urls, as I see some other scripts are doing. Thanks again! /guterm |
01-14-2011, 07:02 PM | #8 |
creator of calibre
Posts: 43,843
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Just return None from preprocess_html
|
01-15-2011, 10:45 PM | #9 |
Junior Member
Posts: 9
Karma: 10
Join Date: Jan 2011
Device: Sony PRS-650
|
Fantastic, I just posted updated Globe recipe, it may make sense to update the built-in one. Do you want me to submit as attachment to that bug?
|
01-16-2011, 01:20 AM | #10 |
creator of calibre
Posts: 43,843
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
either a bug or a message in this forum saying the recipe is ready to be updated.
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Conversion failures after 0.7.29 | Stash123 | Calibre | 7 | 01-10-2011 06:16 PM |
PRS-650 Question: Intermittent return to home page while reading comic epub in SD card | lack | Sony Reader | 5 | 12-21-2010 02:09 AM |
Intermittent Database Exception | jt421 | Calibre | 7 | 12-22-2009 04:41 AM |
ebooks.adelaide Mobi Conversion Failures | ascherjim | Calibre | 16 | 07-14-2009 12:16 PM |
iLiad Debugging and the iLiad | scotty1024 | iRex Developer's Corner | 2 | 10-23-2006 03:43 PM |