03-12-2011, 10:58 AM | #1 |
Junior Member
Posts: 3
Karma: 10
Join Date: Mar 2011
Device: Nook
|
Politifact Recipe Problems
I'm having problems with the Politifact recipe. It posts the short descriptions fine in the section listing, but some of the actual articles are just a mess of symbols and special characters. Many of the articles come out fine. I've tried figuring out what is being interpreted in the python code that is causing this so that I can set the code to remove the offending tags, but without success. I'm new at this so I'm probably missing something. Ideas?
|
03-13-2011, 11:11 AM | #2 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
I've seen this when ads are being randomly inserted. See if that's happening. I've also seen it when redirects occur and the processing isn't following quickly enough. Try adding a delay and running a single thread download:
Code:
simultaneous_downloads = 1 delay = 5 |
Advert | |
|
03-13-2011, 09:34 PM | #3 | |
Junior Member
Posts: 3
Karma: 10
Join Date: Mar 2011
Device: Nook
|
Quote:
|
|
03-13-2011, 09:47 PM | #4 |
Junior Member
Posts: 3
Karma: 10
Join Date: Mar 2011
Device: Nook
|
Just found the problem after searching further and following one of the ideas in the reusable code section. It was all about the links in certain stories. I used the code segment that converts links to text and now no more problems. Code reprinted here for the next person
Spoiler:
|
08-23-2011, 09:49 PM | #5 |
Member
Posts: 12
Karma: 10
Join Date: Aug 2011
Device: Nook
|
Is there a way to get this into the Calibre release? I'm seeing the same issue with the latest version.
|
Advert | |
|
08-24-2011, 09:40 AM | #6 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
I checked it out by adding that code to the Politifact recipe , and it doesn't solve the problem. |
|
08-24-2011, 04:36 PM | #7 | |
Member
Posts: 12
Karma: 10
Join Date: Aug 2011
Device: Nook
|
Quote:
I'm thinking it's a download problem. I copied the script off and ran ebook-convert PolitifactKJN.recipe .epub -vv --debug-pipeline debug Then I found a bad section, and hunted it down in debug\input. The index.html there is just garbage. Isn't that the raw stuff downloaded before the recipe kicks in? If not, how do I get to the raw stuff? |
|
08-24-2011, 05:00 PM | #8 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Code:
def preprocess_html(self, soup): print 'The raw stuff is: ', soup return soup |
|
08-25-2011, 10:18 AM | #9 |
Member
Posts: 12
Karma: 10
Join Date: Aug 2011
Device: Nook
|
It's not clear. Yesterday, I was looking at a check on Krugman, and in several runs it was always bad, but then it was ok. Today that one is still ok, but I've had four runs where a check on farm tractors is garbage.
It is the raw data though, the debug snippet you gave me shows the crud. I also see that all of the crud shows "WARNING: Encoding detection confidence 0%" I captured the complete fetch with WireShark, and I can't find any garbage in the capture. I did find at least one reply that came in gzip'd though, I don't know if Calibre can handle a gzip'd response. |
08-25-2011, 10:40 AM | #10 |
creator of calibre
Posts: 43,856
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
A server should never send gzip if the client doesn;t say it accepts it. But you can add gzip support to a particular recipe by adding:
Code:
def get_browser(self): br = BasicNewsRecipe.get_browser(self) br.set_handle_gzip(True) return br |
08-25-2011, 10:57 AM | #11 | |
Member
Posts: 12
Karma: 10
Join Date: Aug 2011
Device: Nook
|
Quote:
Given that gzip is possible, is there any reason to not decode gzip even if it wasn't requested? |
|
08-25-2011, 11:26 AM | #12 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
If it continues to work, give us an update. I can't recall seeing anything quite like this before, but it's a handy tool to know about.
|
08-26-2011, 11:33 AM | #13 | |
Member
Posts: 12
Karma: 10
Join Date: Aug 2011
Device: Nook
|
Quote:
I also see that the Obamameter feed isn't right, it needs to somehow follow another link in, but that's not a very interesting feed to me. Anyone know what procedural hoops I need to go through to get this into the official release? (Yes, I should just RTM) |
|
08-26-2011, 02:25 PM | #14 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Kovid will usually pick it up here. He probably prefers a complete tested recipe, rather than a code chunk to add in, which may require more testing from him, but this one's pretty simple.
|
08-26-2011, 02:57 PM | #15 |
creator of calibre
Posts: 43,856
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
This is already in 0.8.16
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Nook (classic) problems with Sports Illustrated Recipe | spedinfargo | Recipes | 2 | 02-03-2011 06:41 PM |
Recipe problems | aessedai44 | Recipes | 0 | 10-27-2010 12:17 AM |
Problems with economist recipe | lady kay | Calibre | 1 | 08-06-2010 07:49 AM |
Problems with Economist recipe 0.5.1 | MTBSJC | Calibre | 7 | 03-23-2009 01:54 PM |
Problems writing recipe | kiklop74 | Calibre | 9 | 10-28-2008 06:58 PM |