![]() |
#1 |
Junior Member
![]() Posts: 9
Karma: 10
Join Date: Jan 2011
Device: Sony PRS-650
|
Debugging intermittent failures, how to?
Dear all,
I am trying to debug intermittent failures with download and conversion of Globe&Mail articles. About ~5% or so of articles end up with missing text and I am at loss to understand a root cause. I cannot reproduce the problem with --test, yet on a full paper download I always end up with a couple of empty articles. Is there a way/setting/option to download and preserve all source unprocessed html files when using a class derived from BasicNewsRecipe and then just to rerun processing on pre-downloaded files? This would hopefully allow me to understand if it's a download or processing issue and ease the debugging. Thanks guys! /guterm |
![]() |
![]() |
![]() |
#2 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,295
Karma: 27111240
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
--debug-pipeline
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Junior Member
![]() Posts: 9
Karma: 10
Join Date: Jan 2011
Device: Sony PRS-650
|
Thank you! Managed to progress further.
The failing articles in the "input" directory are already empty. For each failing article I see the following error message: Code:
Parsing feed_4/article_0/index.html ... Initial parse failed: Traceback (most recent call last): File "site-packages\calibre\ebooks\oeb\base.py", line 857, in first_pass File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etre e.c:48634) File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxm l.etree.c:72245) File "parser.pxi", line 1417, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:7 1041) File "parser.pxi", line 898, in lxml.etree._BaseParser._parseUnicodeDoc (src/l xml/lxml.etree.c:67581) File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDo c (src/lxml/lxml.etree.c:64257) File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.e tree.c:65178) File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etr ee.c:64521) XMLSyntaxError: xmlParseEntityRef: no name, line 4, column 17 |
![]() |
![]() |
![]() |
#4 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,295
Karma: 27111240
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
If they are empty in the input directory then the recipe is downloading them empty. That usually means your remove_Tags and similar settings are too aggressive.
|
![]() |
![]() |
![]() |
#5 |
Junior Member
![]() Posts: 9
Karma: 10
Join Date: Jan 2011
Device: Sony PRS-650
|
Thank you, let me play with that.
Is there a way to preserve the EXACT html that got downloaded? I already confirmed that globe&mail actively varies output/structure of the same article from one download to another, would be good to find the exact problem. /guterm |
![]() |
![]() |
Advert | |
|
![]() |
#6 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,295
Karma: 27111240
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Use the preprocess_html function and save the html to a file
|
![]() |
![]() |
![]() |
#7 |
Junior Member
![]() Posts: 9
Karma: 10
Join Date: Jan 2011
Device: Sony PRS-650
|
Thank you, that's most helpful, the server was switching div classes back and forth.
Somewhat related question, is there a way to completely delete an article after looking at it's content? E.g., I parse an article and I am finding that it's nothing but a slideshow or a video, is there a way to remove it from the feed index without corrupting processing? Unfortunately I cannot determine that based on urls, as I see some other scripts are doing. Thanks again! /guterm |
![]() |
![]() |
![]() |
#8 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,295
Karma: 27111240
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Just return None from preprocess_html
|
![]() |
![]() |
![]() |
#9 |
Junior Member
![]() Posts: 9
Karma: 10
Join Date: Jan 2011
Device: Sony PRS-650
|
Fantastic, I just posted updated Globe recipe, it may make sense to update the built-in one. Do you want me to submit as attachment to that bug?
|
![]() |
![]() |
![]() |
#10 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,295
Karma: 27111240
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
either a bug or a message in this forum saying the recipe is ready to be updated.
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Conversion failures after 0.7.29 | Stash123 | Calibre | 7 | 01-10-2011 06:16 PM |
PRS-650 Question: Intermittent return to home page while reading comic epub in SD card | lack | Sony Reader | 5 | 12-21-2010 02:09 AM |
Intermittent Database Exception | jt421 | Calibre | 7 | 12-22-2009 04:41 AM |
ebooks.adelaide Mobi Conversion Failures | ascherjim | Calibre | 16 | 07-14-2009 12:16 PM |
iLiad Debugging and the iLiad | scotty1024 | iRex Developer's Corner | 2 | 10-23-2006 03:43 PM |