Information Week still not fully working

Camper65 · 11-30-2013, 11:20 PM

I've been trying to fix this recipe of mine for a while now, but still can't get it to pull multipage articles. At least now it's pulling first page and letting me go to the other pages.

My recipe

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.web.feeds import Feed

class InformationWeek(BasicNewsRecipe):
    title          = u'InformationWeek'
    oldest_article = 6
    max_articles_per_feed = 150
    auto_cleanup = True
    ignore_duplicate_articles = {'title', 'url'}
    remove_empty_feeds = True
    remove_javascript = False
    use_embedded_content   = True
    recursions = 1
    match_regexps = [r'page_number=[0-9]+']
    

    feeds          = [
                          (u'InformationWeek - Stories', u'www.informationweek.com/rss_feeds.asp'),
                          (u'InformationWeek - Software', u'http://www.informationweek.com/rss_simple.asp?f_n=476&f_ln=Software'),
                     ]

    def parse_feeds (self): 
      feeds = BasicNewsRecipe.parse_feeds(self) 
      for feed in feeds:
        for article in feed.articles[:]:
          print 'article.title is: ', article.title
          if 'healthcare' in article.title or 'healthcare' in article.url:
            feed.articles.remove(article)
      return feeds

the area of an article that shows (in bold) the language for the next page

Spoiler:

<div class="divsplitter" style="height: 1.25em;"></div><div style="height:
1.666em;"><div style="float: right;"><span class="smaller blue"><img src="
http://img.deusm.com/informationweek/slideshow-arrow-gray-left.png"
alt="Previous" style="width: 1.666em; height: 1.666em; border: 0; float: left;
margin-right: 0.666em;" /><div style="float: left; height: 1.416666em; padding-
top: .25em;">1 of 3</div><a href="
http://www.informationweek.com/mobil...pping-guide-8-
tips/d/d-id/1112842?page_number=2" title="Next" ><img src="
http://img.deusm.com/informationweek/slideshow-arrow-black-right.png" alt="Next"
style="width: 1.666em; height: 1.666em; border: 0; float: right; margin-left:
0.666em;" /></a></span></div></div><div class="divsplitter" style="height:
.666em;"></div><div style="float: left; margin-right: 2px;"><span class="smaller
blue allcaps"><a href="#msgs">Comment</a>  | </span></div><div
style="float: left; margin-right: 2px;"><span class="smaller blue allcaps"><a
href="email.asp"
onclick="window.open('/email.asp?url='+encodeURIComponent(thispage_sharel ink)
+'&title='+encodeURIComponent(document.title),'',' '); return false;">Email
This</a>  | </span></div><div style="float: left; margin-right: 2px;">
<span class="smaller blue allcaps"><a href="/mobile/mobile-devices/tablet-
shopping-guide-8-tips/d/d-id/1112842?print=yes">Print</a>  | </span>
</div><div style="float: left; margin-right: 2px;"><span class="smaller blue
allcaps"><a href="http://www.informationweek.com/rss_simple.asp">RSS</a></span>
</div><div class="divsplitter" style="height: .666em;"></div><div
class="divsplitter" style="height: 4px; background: #aaa;"></div><div
class="divsplitter" style="height: .666em;"></div><div id="more-insights"><span
class="smaller strong red allcaps">More Insights</span></div><div
class="divsplitter" style="height: 0.25em;"></div><div class="more-insights-
item"><span class="small strong darkgray">Webcasts</span><div
class="divsplitter" style="height: 0.25em;"></div><div xmlns:a10="
http://www.w3.org/2005/Atom">

sample article that has multiple pages:
http://www.informationweek.com/mobil...ek_sitedefault

what changes to I need to make to my recipe to get this to work right?

kovidgoyal · 11-30-2013, 11:29 PM

You cannot use auto_cleanup and match_regexps, since it is quite likely that auto_cleanup is removing the links that you intend to follow.

Camper65 · 12-01-2013, 10:10 AM

okay removed that but now it's not even giving me the full first page. It's only giving me the title and the description under the title (possibly from the feed itself and not the article?). I also had to change the first feed since that was changed (and I didn't realize it until I did the testing again).

(using the debug for testing purposes bat file)

Here is what the input of the feed comes up with

Spoiler:

this is the feed I'm using from their website.
http://www.informationweek.com/rss_simple.asp

using the batch file to create myrecipe.txt gives me this:

Spoiler:

Any ideas?

Camper65 · 04-01-2014, 11:30 PM

Still having trouble getting page 2+ of articles from InformationWeek.

Here is the recipe I'm trying (with two different ways I think of getting more than one page)

Spoiler:

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.web.feeds import Feed

class InformationWeek(BasicNewsRecipe):
    title          = u'InformationWeek'
    oldest_article = 3
    max_articles_per_feed = 150
    auto_cleanup = True
    ignore_duplicate_articles = {'title', 'url'}
    remove_empty_feeds = True
    remove_javascript = True
    use_embedded_content   = False
    recursions = 0

    feeds          = [
                          (u'InformationWeek - Stories', u'http://www.informationweek.com/rss_simple.asp'),
                          (u'InformationWeek - Software', u'http://www.informationweek.com/rss_simple.asp?f_n=476&f_ln=Software'),
                          (u'InformationWeek - Mobile', u'http://www.informationweek.com/rss_simple.asp?f_n=457&f_ln=Mobile')
                     ]

    def parse_feeds (self): 
      feeds = BasicNewsRecipe.parse_feeds(self) 
      for feed in feeds:
        for article in feed.articles[:]:
          print 'article.title is: ', article.title
          if 'healthcare' in article.title or 'healthcare' in article.url:
            feed.articles.remove(article)
      return feeds

#    def is_link_wanted(self, url, tag):
#            ans = re.match(r'href://.*/[2-9]/', url) is not None
#            if ans:
#                self.log('Following multipage link: %s'%url)
#            return ans
    
#    def postprocess_html(self, soup, first_fetch):
#            for pag in soup.findAll(True, 'pagination'):
#                pag.extract()
#            if not first_fetch:
#                h1 = soup.find('h1')
#                if h1 is not None:
#                    h1.extract()
#            return soup

#another attempt at pulling more than one page

    def append_page(self, soup, appendtag, position, surl):
        pager = soup.find('div', attrs={'class':'pages'})
        if pager:
          nextpages = soup.findAll('a', attrs={'class':'a1'})
          nextpage = nextpages[1]
          if nextpage and (nextpage['href'] != surl):
              nexturl = nextpage['href']
              soup2 = self.index_to_soup(nexturl)
              texttag = soup2.find('div', attrs={'class':'content_left_5'})
              for it in texttag.findAll(style=True):
                  del it['style']
              newpos = len(texttag.contents)
              self.append_page(soup2,texttag,newpos,nexturl)
              texttag.extract()
              pager.extract()
              appendtag.insert(position,texttag)

    def preprocess_html(self, soup):
        self.append_page(soup, soup.body, 3, '')
        pager = soup.find('div', attrs={'id':'pages'})
        if pager:
          pager.extract()
        return self.adeify_images(soup)

here is a link to an article with more than one page
http://www.informationweek.com/softw...d/d-id/1141628

and the text for the next page area of the first page

Spoiler:

what am I doing wrong to get the next page (and more if more than 2 pages)?

Divingduck · 04-02-2014, 07:19 AM

It is maybe better to look on
<p align="right">... <b><a href>... Next Page

instead of
<div class="divsplitter...>

Camper65 · 04-02-2014, 10:00 PM

Since this is the first time I'm dealing with multiple pages, figured I'd post the changes and see if you think I got it. I'm good with feeds with single page articles already but these multipage.....

So I have it finding the p with align right from the web page, unfortunately tonights download was all one page documents so I can't tell if it worked right or not... thanks for your help on this recipe

Code:

#another attempt at pulling more than one page

    def append_page(self, soup, appendtag, position, surl):
        pager = soup.find('p', attrs={'align':'right'})
        if pager:
          nextpages = soup.findAll('p', attrs={'align':'right'})
          nextpage = nextpages[1]
          if nextpage and (nextpage['href'] != surl):
              nexturl = nextpage['href']
              soup2 = self.index_to_soup(nexturl)
              texttag = soup2.find('p', attrs={'align':'right'})
              for it in texttag.findAll(style=True):
                  del it['style']
              newpos = len(texttag.contents)
              self.append_page(soup2,texttag,newpos,nexturl)
              texttag.extract()
              pager.extract()
              appendtag.insert(position,texttag)

    def preprocess_html(self, soup):
        self.append_page(soup, soup.body, 3, '')
        pager = soup.find('div', attrs={'id':'pages'})
        if pager:
          pager.extract()
        return self.adeify_images(soup)

11-30-2013, 11:20 PM	#1
Camper65 Enthusiast Posts: 32 Karma: 10 Join Date: Apr 2011 Device: Kindle wifi; Dell 2in1	Information Week still not fully working I've been trying to fix this recipe of mine for a while now, but still can't get it to pull multipage articles. At least now it's pulling first page and letting me go to the other pages. My recipe Code: from calibre.web.feeds.news import BasicNewsRecipe from calibre.web.feeds import Feed class InformationWeek(BasicNewsRecipe): title = u'InformationWeek' oldest_article = 6 max_articles_per_feed = 150 auto_cleanup = True ignore_duplicate_articles = {'title', 'url'} remove_empty_feeds = True remove_javascript = False use_embedded_content = True recursions = 1 match_regexps = [r'page_number=[0-9]+'] feeds = [ (u'InformationWeek - Stories', u'www.informationweek.com/rss_feeds.asp'), (u'InformationWeek - Software', u'http://www.informationweek.com/rss_simple.asp?f_n=476&f_ln=Software'), ] def parse_feeds (self): feeds = BasicNewsRecipe.parse_feeds(self) for feed in feeds: for article in feed.articles[:]: print 'article.title is: ', article.title if 'healthcare' in article.title or 'healthcare' in article.url: feed.articles.remove(article) return feeds the area of an article that shows (in bold) the language for the next page Spoiler: <div class="divsplitter" style="height: 1.25em;"></div><div style="height: 1.666em;"><div style="float: right;"><span class="smaller blue"><img src=" http://img.deusm.com/informationweek/slideshow-arrow-gray-left.png" alt="Previous" style="width: 1.666em; height: 1.666em; border: 0; float: left; margin-right: 0.666em;" /><div style="float: left; height: 1.416666em; padding- top: .25em;">1 of 3</div><a href=" http://www.informationweek.com/mobil...pping-guide-8- tips/d/d-id/1112842?page_number=2" title="Next" ><img src=" http://img.deusm.com/informationweek/slideshow-arrow-black-right.png" alt="Next" style="width: 1.666em; height: 1.666em; border: 0; float: right; margin-left: 0.666em;" /></a></span></div></div><div class="divsplitter" style="height: .666em;"></div><div style="float: left; margin-right: 2px;"><span class="smaller blue allcaps"><a href="#msgs">Comment</a>  \| </span></div><div style="float: left; margin-right: 2px;"><span class="smaller blue allcaps"><a href="email.asp" onclick="window.open('/email.asp?url='+encodeURIComponent(thispage_sharel ink) +'&title='+encodeURIComponent(document.title),'',' '); return false;">Email This</a>  \| </span></div><div style="float: left; margin-right: 2px;"> <span class="smaller blue allcaps"><a href="/mobile/mobile-devices/tablet- shopping-guide-8-tips/d/d-id/1112842?print=yes">Print</a>  \| </span> </div><div style="float: left; margin-right: 2px;"><span class="smaller blue allcaps"><a href="http://www.informationweek.com/rss_simple.asp">RSS</a></span> </div><div class="divsplitter" style="height: .666em;"></div><div class="divsplitter" style="height: 4px; background: #aaa;"></div><div class="divsplitter" style="height: .666em;"></div><div id="more-insights"><span class="smaller strong red allcaps">More Insights</span></div><div class="divsplitter" style="height: 0.25em;"></div><div class="more-insights- item"><span class="small strong darkgray">Webcasts</span><div class="divsplitter" style="height: 0.25em;"></div><div xmlns:a10=" http://www.w3.org/2005/Atom"> sample article that has multiple pages: http://www.informationweek.com/mobil...ek_sitedefault what changes to I need to make to my recipe to get this to work right?

12-01-2013, 10:10 AM	#3
Camper65 Enthusiast Posts: 32 Karma: 10 Join Date: Apr 2011 Device: Kindle wifi; Dell 2in1	okay removed that but now it's not even giving me the full first page. It's only giving me the title and the description under the title (possibly from the feed itself and not the article?). I also had to change the first feed since that was changed (and I didn't realize it until I did the testing again). (using the debug for testing purposes bat file) Here is what the input of the feed comes up with Spoiler: <html> <head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><title>EU Tells US: End Mass Spying</title><style type="text/css" title="override_css"> .article_date { color: gray; font-family: monospace; } .article_description { text-indent: 0pt; } a.article { font-weight: bold; text-align:left; } a.feed { font-weight: bold; } .calibre_navbar { font-family:monospace; } </style></head> <body><div style="text-align:center" class="calibre_navbar calibre_rescale_70">\| <a href="../../feed_1/index.html">Next</a> \| <a href="../index.html#article_0">Section Menu</a> \| <a href="../../index.html#feed_0">Main Menu</a> \| <hr /> </div> <h2>EU Tells US: End Mass Spying</h2> <div><span>Responding to surveillance revelations, EU officials seek changes in commercial and law enforcement data sharing arrangements with the US.</span></div> <div style="text-align:center" class="calibre_navbar calibre_rescale_70"> <hr /> <p style="text-align:left; max-width: 100%; overflow: hidden;">This article was downloaded by <strong>calibre</strong> from <a href="http://www.informationweek.com/mobile/mobile-business/eu-tells-us-end-mass-spying/d/d-id/1112841?f_src=informationweek_sitedefault">http://www.informationweek.com/mobile/mobile-business/eu-tells-us-end-mass-spying/d/d-id/1112841?f_src=informationweek_sitedefault</a></p> <br /><br /> \| <a href="../index.html#article_0">Section Menu</a> \| <a href="../../index.html#feed_0">Main Menu</a> \| </div></body> </html> this is the feed I'm using from their website. http://www.informationweek.com/rss_simple.asp using the batch file to create myrecipe.txt gives me this: Spoiler: Resolved conversion options calibre version: 1.13.0 {'asciiize': False, 'author_sort': None, 'authors': None, 'base_font_size': 0, 'book_producer': None, 'change_justification': 'original', 'chapter': None, 'chapter_mark': 'pagebreak', 'comments': None, 'cover': None, 'debug_pipeline': None, 'dehyphenate': True, 'delete_blank_paragraphs': True, 'disable_font_rescaling': False, 'dont_download_recipe': False, 'duplicate_links_in_toc': False, 'embed_all_fonts': False, 'embed_font_family': None, 'enable_heuristics': False, 'expand_css': False, 'extra_css': None, 'filter_css': None, 'fix_indents': True, 'font_size_mapping': None, 'format_scene_breaks': True, 'html_unwrap_factor': 0.4, 'input_encoding': None, 'input_profile': <calibre.customize.profiles.InputProfile object at 0x0000000003359FD0>, 'insert_blank_line': False, 'insert_blank_line_size': 0.5, 'insert_metadata': False, 'isbn': None, 'italicize_common_cases': True, 'keep_ligatures': False, 'language': None, 'level1_toc': None, 'level2_toc': None, 'level3_toc': None, 'line_height': 0, 'linearize_tables': False, 'lrf': False, 'margin_bottom': 5.0, 'margin_left': 5.0, 'margin_right': 5.0, 'margin_top': 5.0, 'markup_chapter_headings': True, 'max_toc_links': 50, 'minimum_line_height': 120.0, 'no_chapters_in_toc': False, 'no_inline_navbars': False, 'output_profile': <calibre.customize.profiles.OutputProfile object at 0x0000000003355358>, 'page_breaks_before': None, 'prefer_metadata_cover': False, 'pretty_print': True, 'pubdate': None, 'publisher': None, 'rating': None, 'read_metadata_from_opf': None, 'remove_fake_margins': True, 'remove_first_image': False, 'remove_paragraph_spacing': False, 'remove_paragraph_spacing_indent_size': 1.5, 'renumber_headings': True, 'replace_scene_breaks': '', 'search_replace': None, 'series': None, 'series_index': None, 'smarten_punctuation': False, 'sr1_replace': '', 'sr1_search': '', 'sr2_replace': '', 'sr2_search': '', 'sr3_replace': '', 'sr3_search': '', 'start_reading_at': None, 'subset_embedded_fonts': False, 'tags': None, 'test': (2, 2), 'timestamp': None, 'title': None, 'title_sort': None, 'toc_filter': None, 'toc_threshold': 6, 'unsmarten_punctuation': False, 'unwrap_lines': True, 'use_auto_toc': False, 'verbose': 2} 1% Converting input to HTML... InputFormatPlugin: Recipe Input running Using custom recipe 1% Fetching feeds... 1% Fetching feed InformationWeek - Stories... 1% Fetching feed InformationWeek - Software... article.title is: Google Glass Enables Surgeons To Consult Remotely article.title is: EU Tells US: End Mass Spying article.title is: Google Glass Enables Surgeons To Consult Remotely article.title is: Tablet Shopping Guide: 8 Tips 1% Trying to download cover... 1% Generating masthead... Synthesizing mastheadImage 1% Starting download [4 thread(s)]... Downloading Downloading Fetching file:C:\Users\Camper\AppData\Local\Temp\calibre_j4 s_xv\cznsa__feeds2disk.html Fetching file:C:\Users\Camper\AppData\Local\Temp\calibre_j4 s_xv\e2vg4d_feeds2disk.html Processing images... Processing images... Processing links...Processing links... file:C:\Users\Camper\AppData\Local\Temp\calibre_j4 s_xv\e2vg4d_feeds2disk.html saved to C:\Users\Camper\AppData\Local\Temp\calibre_j4s_xv\ cxstei_plumber\feed_0\article_0\e2vg4d_feeds2disk. xhtml file:C:\Users\Camper\AppData\Local\Temp\calibre_j4 s_xv\cznsa__feeds2disk.html saved to C:\Users\Camper\AppData\Local\Temp\calibre_j4s_xv\ cxstei_plumber\feed_1\article_0\cznsa__feeds2disk. xhtml Downloaded article: EU Tells US: End Mass Spying from http://www.informationweek.com/mobil...ek_sitedefault 17% Article downloaded: EU Tells US: End Mass Spying Downloaded article: Tablet Shopping Guide: 8 Tips from http://www.informationweek.com/mobil...nweek_node_476 34% Article downloaded: Tablet Shopping Guide: 8 Tips 34% Feeds downloaded to C:\Users\Camper\AppData\Local\Temp\calibre_j4s_xv\ cxstei_plumber\index.html 34% Download finished Parsing all content... Parsing feed_0/index.html ... Initial parse failed, using more forgiving parsers Parsing feed_0/index.html as HTML Parsing feed_1/index.html ... Initial parse failed, using more forgiving parsers Parsing feed_1/index.html as HTML Parsing feed_1/article_0/index.html ... Forcing feed_1/article_0/index.html into XHTML namespace Parsing feed_0/article_0/index.html ... Forcing feed_0/article_0/index.html into XHTML namespace Parsing index.html ... Forcing index.html into XHTML namespace Referenced file u'feed_2/index.html' not found Reading TOC from NCX... 34% Running transforms on ebook... Merging user specified metadata... Detecting structure... Flattening CSS and remapping font sizes... Source base font size is 12.00000pt Removing fake margins... Found 11 items of level: div_1 Found 4 items of level: div_2 Found 4 items of level: p_2 Found 2 items of level: div_4 Ignoring level p_2 Ignoring level div_4 div_1 left margin stats: Counter() div_1 right margin stats: Counter() div_2 left margin stats: Counter() div_2 right margin stats: Counter() Cleaning up manifest... Trimming unused files from manifest... Creating OEB Output... 67% Running OEB Output plugin The cover image has an id != "cover". Renaming to work around bug in Nook Color OEB output written to G:\Camper\Documents\Calibre Data\Testing news\myrecipe Output saved to G:\Camper\Documents\Calibre Data\Testing news\myrecipe Any ideas? Last edited by Camper65; 12-01-2013 at 10:16 AM.

04-02-2014, 07:19 AM	#5
Divingduck Wizard Posts: 1,166 Karma: 1410083 Join Date: Nov 2010 Location: Germany Device: Sony PRS-650	It is maybe better to look on <p align="right">... <b><a href>... Next Page instead of <div class="divsplitter...> Attached Thumbnails

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
64GB MicroSDXC fully working in PE and Leger Calibre/Dropsync discussion	helf	enTourage eDGe	33	01-18-2014 03:50 PM
Instapaper recipe not working as of this week	largeboulder	Recipes	7	09-04-2013 02:59 AM
Books not being fully deleted	bloodfyr	Devices	5	02-03-2012 05:41 AM
Touch How to tell when the Touch is fully charged?	TonyToews	Kobo Reader	0	06-27-2011 10:20 PM
Information Week: e-Book Readers Need To Get A Lot Cheaper	ekaser	News	7	09-08-2009 08:35 AM

11-30-2013, 11:29 PM	#2
kovidgoyal creator of calibre Posts: 45,339 Karma: 27182818 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You cannot use auto_cleanup and match_regexps, since it is quite likely that auto_cleanup is removing the links that you intend to follow.

Advert

Advert