![]() |
#1 |
Enthusiast
![]() Posts: 32
Karma: 10
Join Date: Apr 2011
Device: Kindle wifi; Dell 2in1
|
Information Week still not fully working
I've been trying to fix this recipe of mine for a while now, but still can't get it to pull multipage articles. At least now it's pulling first page and letting me go to the other pages.
My recipe Code:
from calibre.web.feeds.news import BasicNewsRecipe from calibre.web.feeds import Feed class InformationWeek(BasicNewsRecipe): title = u'InformationWeek' oldest_article = 6 max_articles_per_feed = 150 auto_cleanup = True ignore_duplicate_articles = {'title', 'url'} remove_empty_feeds = True remove_javascript = False use_embedded_content = True recursions = 1 match_regexps = [r'page_number=[0-9]+'] feeds = [ (u'InformationWeek - Stories', u'www.informationweek.com/rss_feeds.asp'), (u'InformationWeek - Software', u'http://www.informationweek.com/rss_simple.asp?f_n=476&f_ln=Software'), ] def parse_feeds (self): feeds = BasicNewsRecipe.parse_feeds(self) for feed in feeds: for article in feed.articles[:]: print 'article.title is: ', article.title if 'healthcare' in article.title or 'healthcare' in article.url: feed.articles.remove(article) return feeds Spoiler:
sample article that has multiple pages: http://www.informationweek.com/mobil...ek_sitedefault what changes to I need to make to my recipe to get this to work right? ![]() |
![]() |
![]() |
![]() |
#2 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,339
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
You cannot use auto_cleanup and match_regexps, since it is quite likely that auto_cleanup is removing the links that you intend to follow.
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Enthusiast
![]() Posts: 32
Karma: 10
Join Date: Apr 2011
Device: Kindle wifi; Dell 2in1
|
okay removed that but now it's not even giving me the full first page. It's only giving me the title and the description under the title (possibly from the feed itself and not the article?). I also had to change the first feed since that was changed (and I didn't realize it until I did the testing again).
(using the debug for testing purposes bat file) Here is what the input of the feed comes up with Spoiler:
this is the feed I'm using from their website. http://www.informationweek.com/rss_simple.asp using the batch file to create myrecipe.txt gives me this: Spoiler:
Any ideas? Last edited by Camper65; 12-01-2013 at 10:16 AM. |
![]() |
![]() |
![]() |
#4 |
Enthusiast
![]() Posts: 32
Karma: 10
Join Date: Apr 2011
Device: Kindle wifi; Dell 2in1
|
Still trying to fix this recipe
Still having trouble getting page 2+ of articles from InformationWeek.
Here is the recipe I'm trying (with two different ways I think of getting more than one page) Spoiler:
here is a link to an article with more than one page http://www.informationweek.com/softw...d/d-id/1141628 and the text for the next page area of the first page Spoiler:
what am I doing wrong to get the next page (and more if more than 2 pages)? |
![]() |
![]() |
![]() |
#5 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,166
Karma: 1410083
Join Date: Nov 2010
Location: Germany
Device: Sony PRS-650
|
It is maybe better to look on
<p align="right">... <b><a href>... Next Page instead of <div class="divsplitter...> |
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Enthusiast
![]() Posts: 32
Karma: 10
Join Date: Apr 2011
Device: Kindle wifi; Dell 2in1
|
Since this is the first time I'm dealing with multiple pages, figured I'd post the changes and see if you think I got it. I'm good with feeds with single page articles already but these multipage.....
So I have it finding the p with align right from the web page, unfortunately tonights download was all one page documents so I can't tell if it worked right or not... thanks for your help on this recipe Code:
#another attempt at pulling more than one page def append_page(self, soup, appendtag, position, surl): pager = soup.find('p', attrs={'align':'right'}) if pager: nextpages = soup.findAll('p', attrs={'align':'right'}) nextpage = nextpages[1] if nextpage and (nextpage['href'] != surl): nexturl = nextpage['href'] soup2 = self.index_to_soup(nexturl) texttag = soup2.find('p', attrs={'align':'right'}) for it in texttag.findAll(style=True): del it['style'] newpos = len(texttag.contents) self.append_page(soup2,texttag,newpos,nexturl) texttag.extract() pager.extract() appendtag.insert(position,texttag) def preprocess_html(self, soup): self.append_page(soup, soup.body, 3, '') pager = soup.find('div', attrs={'id':'pages'}) if pager: pager.extract() return self.adeify_images(soup) |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
64GB MicroSDXC fully working in PE and Leger Calibre/Dropsync discussion | helf | enTourage eDGe | 33 | 01-18-2014 03:50 PM |
Instapaper recipe not working as of this week | largeboulder | Recipes | 7 | 09-04-2013 02:59 AM |
Books not being fully deleted | bloodfyr | Devices | 5 | 02-03-2012 05:41 AM |
Touch How to tell when the Touch is fully charged? | TonyToews | Kobo Reader | 0 | 06-27-2011 10:20 PM |
Information Week: e-Book Readers Need To Get A Lot Cheaper | ekaser | News | 7 | 09-08-2009 08:35 AM |