Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 01-14-2011, 11:29 AM   #1
guterm
Junior Member
guterm began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Jan 2011
Device: Sony PRS-650
Debugging intermittent failures, how to?

Dear all,

I am trying to debug intermittent failures with download and conversion of Globe&Mail articles. About ~5% or so of articles end up with missing text and I am at loss to understand a root cause. I cannot reproduce the problem with --test, yet on a full paper download I always end up with a couple of empty articles.

Is there a way/setting/option to download and preserve all source unprocessed html files when using a class derived from BasicNewsRecipe and then just to rerun processing on pre-downloaded files? This would hopefully allow me to understand if it's a download or processing issue and ease the debugging.

Thanks guys!

/guterm
guterm is offline   Reply With Quote
Old 01-14-2011, 11:35 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 34,554
Karma: 11409410
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
--debug-pipeline
kovidgoyal is offline   Reply With Quote
Advert
Old 01-14-2011, 11:53 AM   #3
guterm
Junior Member
guterm began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Jan 2011
Device: Sony PRS-650
Thank you! Managed to progress further.
The failing articles in the "input" directory are already empty. For each failing article I see the following error message:
Code:
Parsing feed_4/article_0/index.html ...
Initial parse failed:
Traceback (most recent call last):
  File "site-packages\calibre\ebooks\oeb\base.py", line 857, in first_pass
  File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etre
e.c:48634)
  File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxm
l.etree.c:72245)
  File "parser.pxi", line 1417, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:7
1041)
  File "parser.pxi", line 898, in lxml.etree._BaseParser._parseUnicodeDoc (src/l
xml/lxml.etree.c:67581)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDo
c (src/lxml/lxml.etree.c:64257)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.e
tree.c:65178)
  File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etr
ee.c:64521)
XMLSyntaxError: xmlParseEntityRef: no name, line 4, column 17
Have you seen anything like this? I am running the latest version of calibre.
guterm is offline   Reply With Quote
Old 01-14-2011, 12:00 PM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 34,554
Karma: 11409410
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
If they are empty in the input directory then the recipe is downloading them empty. That usually means your remove_Tags and similar settings are too aggressive.
kovidgoyal is offline   Reply With Quote
Old 01-14-2011, 12:08 PM   #5
guterm
Junior Member
guterm began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Jan 2011
Device: Sony PRS-650
Thank you, let me play with that.

Is there a way to preserve the EXACT html that got downloaded?
I already confirmed that globe&mail actively varies output/structure of the same article from one download to another, would be good to find the exact problem.

/guterm
guterm is offline   Reply With Quote
Advert
Old 01-14-2011, 12:45 PM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 34,554
Karma: 11409410
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Use the preprocess_html function and save the html to a file
kovidgoyal is offline   Reply With Quote
Old 01-14-2011, 05:21 PM   #7
guterm
Junior Member
guterm began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Jan 2011
Device: Sony PRS-650
Thank you, that's most helpful, the server was switching div classes back and forth.

Somewhat related question, is there a way to completely delete an article after looking at it's content? E.g., I parse an article and I am finding that it's nothing but a slideshow or a video, is there a way to remove it from the feed index without corrupting processing?

Unfortunately I cannot determine that based on urls, as I see some other scripts are doing.

Thanks again!

/guterm
guterm is offline   Reply With Quote
Old 01-14-2011, 07:02 PM   #8
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 34,554
Karma: 11409410
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Just return None from preprocess_html
kovidgoyal is offline   Reply With Quote
Old 01-15-2011, 10:45 PM   #9
guterm
Junior Member
guterm began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Jan 2011
Device: Sony PRS-650
Fantastic, I just posted updated Globe recipe, it may make sense to update the built-in one. Do you want me to submit as attachment to that bug?
guterm is offline   Reply With Quote
Old 01-16-2011, 01:20 AM   #10
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 34,554
Karma: 11409410
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
either a bug or a message in this forum saying the recipe is ready to be updated.
kovidgoyal is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Conversion failures after 0.7.29 Stash123 Calibre 7 01-10-2011 06:16 PM
PRS-650 Question: Intermittent return to home page while reading comic epub in SD card lack Sony Reader 5 12-21-2010 02:09 AM
Intermittent Database Exception jt421 Calibre 7 12-22-2009 04:41 AM
ebooks.adelaide Mobi Conversion Failures ascherjim Calibre 16 07-14-2009 12:16 PM
iLiad Debugging and the iLiad scotty1024 iRex Developer's Corner 2 10-23-2006 03:43 PM


All times are GMT -4. The time now is 01:53 PM.


MobileRead.com is a privately owned, operated and funded community.