Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 12-01-2013, 12:20 AM   #1
Camper65
Enthusiast
Camper65 began at the beginning.
 
Posts: 28
Karma: 10
Join Date: Apr 2011
Device: Kindle wifi; Toshiba Thrive
Information Week still not fully working

I've been trying to fix this recipe of mine for a while now, but still can't get it to pull multipage articles. At least now it's pulling first page and letting me go to the other pages.

My recipe

Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.web.feeds import Feed

class InformationWeek(BasicNewsRecipe):
    title          = u'InformationWeek'
    oldest_article = 6
    max_articles_per_feed = 150
    auto_cleanup = True
    ignore_duplicate_articles = {'title', 'url'}
    remove_empty_feeds = True
    remove_javascript = False
    use_embedded_content   = True
    recursions = 1
    match_regexps = [r'page_number=[0-9]+']
    

    feeds          = [
                          (u'InformationWeek - Stories', u'www.informationweek.com/rss_feeds.asp'),
                          (u'InformationWeek - Software', u'http://www.informationweek.com/rss_simple.asp?f_n=476&f_ln=Software'),
                     ]

    def parse_feeds (self): 
      feeds = BasicNewsRecipe.parse_feeds(self) 
      for feed in feeds:
        for article in feed.articles[:]:
          print 'article.title is: ', article.title
          if 'healthcare' in article.title or 'healthcare' in article.url:
            feed.articles.remove(article)
      return feeds
the area of an article that shows (in bold) the language for the next page

Spoiler:
<div class="divsplitter" style="height: 1.25em;"></div><div style="height:
1.666em;"><div style="float: right;"><span class="smaller blue"><img src="
http://img.deusm.com/informationweek/slideshow-arrow-gray-left.png"
alt="Previous" style="width: 1.666em; height: 1.666em; border: 0; float: left;
margin-right: 0.666em;" /><div style="float: left; height: 1.416666em; padding-
top: .25em;">1 of 3</div><a href="
http://www.informationweek.com/mobil...pping-guide-8-
tips/d/d-id/1112842?page_number=2" title="Next" ><img src="
http://img.deusm.com/informationweek/slideshow-arrow-black-right.png" alt="Next"
style="width: 1.666em; height: 1.666em; border: 0; float: right; margin-left:
0.666em;" /></a></span>
</div></div><div class="divsplitter" style="height:
.666em;"></div><div style="float: left; margin-right: 2px;"><span class="smaller
blue allcaps"><a href="#msgs">Comment</a> &nbsp;|&nbsp;</span></div><div
style="float: left; margin-right: 2px;"><span class="smaller blue allcaps"><a
href="email.asp"
onclick="window.open('/email.asp?url='+encodeURIComponent(thispage_sharel ink)
+'&title='+encodeURIComponent(document.title),'',' '); return false;">Email
This</a> &nbsp;|&nbsp;</span></div><div style="float: left; margin-right: 2px;">
<span class="smaller blue allcaps"><a href="/mobile/mobile-devices/tablet-
shopping-guide-8-tips/d/d-id/1112842?print=yes">Print</a> &nbsp;|&nbsp;</span>
</div><div style="float: left; margin-right: 2px;"><span class="smaller blue
allcaps"><a href="http://www.informationweek.com/rss_simple.asp">RSS</a></span>
</div><div class="divsplitter" style="height: .666em;"></div><div
class="divsplitter" style="height: 4px; background: #aaa;"></div><div
class="divsplitter" style="height: .666em;"></div><div id="more-insights"><span
class="smaller strong red allcaps">More Insights</span></div><div
class="divsplitter" style="height: 0.25em;"></div><div class="more-insights-
item"><span class="small strong darkgray">Webcasts</span><div
class="divsplitter" style="height: 0.25em;"></div><div xmlns:a10="
http://www.w3.org/2005/Atom">


sample article that has multiple pages:
http://www.informationweek.com/mobil...ek_sitedefault

what changes to I need to make to my recipe to get this to work right?
Camper65 is offline   Reply With Quote
Old 12-01-2013, 12:29 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 26,465
Karma: 5383257
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
You cannot use auto_cleanup and match_regexps, since it is quite likely that auto_cleanup is removing the links that you intend to follow.
kovidgoyal is offline   Reply With Quote
Old 12-01-2013, 11:10 AM   #3
Camper65
Enthusiast
Camper65 began at the beginning.
 
Posts: 28
Karma: 10
Join Date: Apr 2011
Device: Kindle wifi; Toshiba Thrive
okay removed that but now it's not even giving me the full first page. It's only giving me the title and the description under the title (possibly from the feed itself and not the article?). I also had to change the first feed since that was changed (and I didn't realize it until I did the testing again).

(using the debug for testing purposes bat file)

Here is what the input of the feed comes up with

Spoiler:
<html>
<head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><title>EU Tells US: End Mass Spying</title><style type="text/css" title="override_css">
.article_date {
color: gray; font-family: monospace;
}

.article_description {
text-indent: 0pt;
}

a.article {
font-weight: bold; text-align:left;
}

a.feed {
font-weight: bold;
}

.calibre_navbar {
font-family:monospace;
}


</style></head>
<body><div style="text-align:center" class="calibre_navbar calibre_rescale_70">| <a href="../../feed_1/index.html">Next</a> | <a href="../index.html#article_0">Section Menu</a> | <a href="../../index.html#feed_0">Main Menu</a> | <hr />
</div>
<h2>EU Tells US: End Mass Spying</h2>
<div><span>Responding to surveillance revelations, EU officials seek changes in commercial and law enforcement data sharing arrangements with the US.</span></div>
<div style="text-align:center" class="calibre_navbar calibre_rescale_70">
<hr />
<p style="text-align:left; max-width: 100%; overflow: hidden;">This article was downloaded by <strong>calibre</strong> from <a href="http://www.informationweek.com/mobile/mobile-business/eu-tells-us-end-mass-spying/d/d-id/1112841?f_src=informationweek_sitedefault">http://www.informationweek.com/mobile/mobile-business/eu-tells-us-end-mass-spying/d/d-id/1112841?f_src=informationweek_sitedefault</a></p>
<br /><br /> | <a href="../index.html#article_0">Section Menu</a> | <a href="../../index.html#feed_0">Main Menu</a> | </div></body>
</html>


this is the feed I'm using from their website.
http://www.informationweek.com/rss_simple.asp

using the batch file to create myrecipe.txt gives me this:

Spoiler:
Resolved conversion options
calibre version: 1.13.0
{'asciiize': False,
'author_sort': None,
'authors': None,
'base_font_size': 0,
'book_producer': None,
'change_justification': 'original',
'chapter': None,
'chapter_mark': 'pagebreak',
'comments': None,
'cover': None,
'debug_pipeline': None,
'dehyphenate': True,
'delete_blank_paragraphs': True,
'disable_font_rescaling': False,
'dont_download_recipe': False,
'duplicate_links_in_toc': False,
'embed_all_fonts': False,
'embed_font_family': None,
'enable_heuristics': False,
'expand_css': False,
'extra_css': None,
'filter_css': None,
'fix_indents': True,
'font_size_mapping': None,
'format_scene_breaks': True,
'html_unwrap_factor': 0.4,
'input_encoding': None,
'input_profile': <calibre.customize.profiles.InputProfile object at 0x0000000003359FD0>,
'insert_blank_line': False,
'insert_blank_line_size': 0.5,
'insert_metadata': False,
'isbn': None,
'italicize_common_cases': True,
'keep_ligatures': False,
'language': None,
'level1_toc': None,
'level2_toc': None,
'level3_toc': None,
'line_height': 0,
'linearize_tables': False,
'lrf': False,
'margin_bottom': 5.0,
'margin_left': 5.0,
'margin_right': 5.0,
'margin_top': 5.0,
'markup_chapter_headings': True,
'max_toc_links': 50,
'minimum_line_height': 120.0,
'no_chapters_in_toc': False,
'no_inline_navbars': False,
'output_profile': <calibre.customize.profiles.OutputProfile object at 0x0000000003355358>,
'page_breaks_before': None,
'prefer_metadata_cover': False,
'pretty_print': True,
'pubdate': None,
'publisher': None,
'rating': None,
'read_metadata_from_opf': None,
'remove_fake_margins': True,
'remove_first_image': False,
'remove_paragraph_spacing': False,
'remove_paragraph_spacing_indent_size': 1.5,
'renumber_headings': True,
'replace_scene_breaks': '',
'search_replace': None,
'series': None,
'series_index': None,
'smarten_punctuation': False,
'sr1_replace': '',
'sr1_search': '',
'sr2_replace': '',
'sr2_search': '',
'sr3_replace': '',
'sr3_search': '',
'start_reading_at': None,
'subset_embedded_fonts': False,
'tags': None,
'test': (2, 2),
'timestamp': None,
'title': None,
'title_sort': None,
'toc_filter': None,
'toc_threshold': 6,
'unsmarten_punctuation': False,
'unwrap_lines': True,
'use_auto_toc': False,
'verbose': 2}
1% Converting input to HTML...
InputFormatPlugin: Recipe Input running
Using custom recipe
1% Fetching feeds...
1% Fetching feed InformationWeek - Stories...
1% Fetching feed InformationWeek - Software...
article.title is: Google Glass Enables Surgeons To Consult Remotely
article.title is: EU Tells US: End Mass Spying
article.title is: Google Glass Enables Surgeons To Consult Remotely
article.title is: Tablet Shopping Guide: 8 Tips
1% Trying to download cover...
1% Generating masthead...
Synthesizing mastheadImage
1% Starting download [4 thread(s)]...
Downloading
Downloading
Fetching file:C:\Users\Camper\AppData\Local\Temp\calibre_j4 s_xv\cznsa__feeds2disk.html
Fetching file:C:\Users\Camper\AppData\Local\Temp\calibre_j4 s_xv\e2vg4d_feeds2disk.html
Processing images...
Processing images...
Processing links...Processing links...

file:C:\Users\Camper\AppData\Local\Temp\calibre_j4 s_xv\e2vg4d_feeds2disk.html saved to C:\Users\Camper\AppData\Local\Temp\calibre_j4s_xv\ cxstei_plumber\feed_0\article_0\e2vg4d_feeds2disk. xhtml
file:C:\Users\Camper\AppData\Local\Temp\calibre_j4 s_xv\cznsa__feeds2disk.html saved to C:\Users\Camper\AppData\Local\Temp\calibre_j4s_xv\ cxstei_plumber\feed_1\article_0\cznsa__feeds2disk. xhtml
Downloaded article: EU Tells US: End Mass Spying from http://www.informationweek.com/mobil...ek_sitedefault
17% Article downloaded: EU Tells US: End Mass Spying
Downloaded article: Tablet Shopping Guide: 8 Tips from http://www.informationweek.com/mobil...nweek_node_476
34% Article downloaded: Tablet Shopping Guide: 8 Tips
34% Feeds downloaded to C:\Users\Camper\AppData\Local\Temp\calibre_j4s_xv\ cxstei_plumber\index.html
34% Download finished
Parsing all content...
Parsing feed_0/index.html ...
Initial parse failed, using more forgiving parsers
Parsing feed_0/index.html as HTML
Parsing feed_1/index.html ...
Initial parse failed, using more forgiving parsers
Parsing feed_1/index.html as HTML
Parsing feed_1/article_0/index.html ...
Forcing feed_1/article_0/index.html into XHTML namespace
Parsing feed_0/article_0/index.html ...
Forcing feed_0/article_0/index.html into XHTML namespace
Parsing index.html ...
Forcing index.html into XHTML namespace
Referenced file u'feed_2/index.html' not found
Reading TOC from NCX...
34% Running transforms on ebook...
Merging user specified metadata...
Detecting structure...
Flattening CSS and remapping font sizes...
Source base font size is 12.00000pt
Removing fake margins...
Found 11 items of level: div_1
Found 4 items of level: div_2
Found 4 items of level: p_2
Found 2 items of level: div_4
Ignoring level p_2
Ignoring level div_4
div_1 left margin stats: Counter()
div_1 right margin stats: Counter()
div_2 left margin stats: Counter()
div_2 right margin stats: Counter()
Cleaning up manifest...
Trimming unused files from manifest...
Creating OEB Output...
67% Running OEB Output plugin
The cover image has an id != "cover". Renaming to work around bug in Nook Color
OEB output written to G:\Camper\Documents\Calibre Data\Testing news\myrecipe
Output saved to G:\Camper\Documents\Calibre Data\Testing news\myrecipe


Any ideas?

Last edited by Camper65; 12-01-2013 at 11:16 AM.
Camper65 is offline   Reply With Quote
Old 04-02-2014, 12:30 AM   #4
Camper65
Enthusiast
Camper65 began at the beginning.
 
Posts: 28
Karma: 10
Join Date: Apr 2011
Device: Kindle wifi; Toshiba Thrive
Still trying to fix this recipe

Still having trouble getting page 2+ of articles from InformationWeek.

Here is the recipe I'm trying (with two different ways I think of getting more than one page)

Spoiler:
Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.web.feeds import Feed

class InformationWeek(BasicNewsRecipe):
    title          = u'InformationWeek'
    oldest_article = 3
    max_articles_per_feed = 150
    auto_cleanup = True
    ignore_duplicate_articles = {'title', 'url'}
    remove_empty_feeds = True
    remove_javascript = True
    use_embedded_content   = False
    recursions = 0

    feeds          = [
                          (u'InformationWeek - Stories', u'http://www.informationweek.com/rss_simple.asp'),
                          (u'InformationWeek - Software', u'http://www.informationweek.com/rss_simple.asp?f_n=476&f_ln=Software'),
                          (u'InformationWeek - Mobile', u'http://www.informationweek.com/rss_simple.asp?f_n=457&f_ln=Mobile')
                     ]

    def parse_feeds (self): 
      feeds = BasicNewsRecipe.parse_feeds(self) 
      for feed in feeds:
        for article in feed.articles[:]:
          print 'article.title is: ', article.title
          if 'healthcare' in article.title or 'healthcare' in article.url:
            feed.articles.remove(article)
      return feeds

#    def is_link_wanted(self, url, tag):
#            ans = re.match(r'href://.*/[2-9]/', url) is not None
#            if ans:
#                self.log('Following multipage link: %s'%url)
#            return ans
    
#    def postprocess_html(self, soup, first_fetch):
#            for pag in soup.findAll(True, 'pagination'):
#                pag.extract()
#            if not first_fetch:
#                h1 = soup.find('h1')
#                if h1 is not None:
#                    h1.extract()
#            return soup

#another attempt at pulling more than one page

    def append_page(self, soup, appendtag, position, surl):
        pager = soup.find('div', attrs={'class':'pages'})
        if pager:
          nextpages = soup.findAll('a', attrs={'class':'a1'})
          nextpage = nextpages[1]
          if nextpage and (nextpage['href'] != surl):
              nexturl = nextpage['href']
              soup2 = self.index_to_soup(nexturl)
              texttag = soup2.find('div', attrs={'class':'content_left_5'})
              for it in texttag.findAll(style=True):
                  del it['style']
              newpos = len(texttag.contents)
              self.append_page(soup2,texttag,newpos,nexturl)
              texttag.extract()
              pager.extract()
              appendtag.insert(position,texttag)

    def preprocess_html(self, soup):
        self.append_page(soup, soup.body, 3, '')
        pager = soup.find('div', attrs={'id':'pages'})
        if pager:
          pager.extract()
        return self.adeify_images(soup)


here is a link to an article with more than one page
http://www.informationweek.com/softw...d/d-id/1141628

and the text for the next page area of the first page

Spoiler:
<div class="divsplitter" style="height: 1.25em;"></div><div style="height: 1.666em;"><div style="float: right;"><span class="smaller blue"><img src="http://img.deusm.com/informationweek/slideshow-arrow-gray-left.png" alt="Previous" style="width: 1.666em; height: 1.666em; border: 0; float: left; margin-right: 0.666em;" /><div style="float: left; height: 1.416666em; padding-top: .25em;">1 of 2</div><a href="http://www.informationweek.com/software/productivity-collaboration-apps/6-new-google-apps-tips-and-tricks/d/d-id/1141627?page_number=2" title="Next" ><img src="http://img.deusm.com/informationweek/slideshow-arrow-black-right.png" alt="Next" style="width: 1.666em; height: 1.666em; border: 0; float: right; margin-left: 0.666em;" /></a></span></div></div>


what am I doing wrong to get the next page (and more if more than 2 pages)?
Camper65 is offline   Reply With Quote
Old 04-02-2014, 08:19 AM   #5
Divingduck
Fanatic
Divingduck can talk all four legs off a donkey... then persuade it to go for a walk.Divingduck can talk all four legs off a donkey... then persuade it to go for a walk.Divingduck can talk all four legs off a donkey... then persuade it to go for a walk.Divingduck can talk all four legs off a donkey... then persuade it to go for a walk.Divingduck can talk all four legs off a donkey... then persuade it to go for a walk.Divingduck can talk all four legs off a donkey... then persuade it to go for a walk.Divingduck can talk all four legs off a donkey... then persuade it to go for a walk.Divingduck can talk all four legs off a donkey... then persuade it to go for a walk.Divingduck can talk all four legs off a donkey... then persuade it to go for a walk.Divingduck can talk all four legs off a donkey... then persuade it to go for a walk.Divingduck can talk all four legs off a donkey... then persuade it to go for a walk.
 
Posts: 562
Karma: 124000
Join Date: Nov 2010
Location: Germany
Device: Sony PRS-650
It is maybe better to look on
<p align="right">... <b><a href>... Next Page

instead of
<div class="divsplitter...>
Attached Thumbnails
Click image for larger version

Name:	Aufzeichnen.JPG
Views:	28
Size:	259.0 KB
ID:	121137  
Divingduck is offline   Reply With Quote
Old 04-02-2014, 11:00 PM   #6
Camper65
Enthusiast
Camper65 began at the beginning.
 
Posts: 28
Karma: 10
Join Date: Apr 2011
Device: Kindle wifi; Toshiba Thrive
Since this is the first time I'm dealing with multiple pages, figured I'd post the changes and see if you think I got it. I'm good with feeds with single page articles already but these multipage.....

So I have it finding the p with align right from the web page, unfortunately tonights download was all one page documents so I can't tell if it worked right or not... thanks for your help on this recipe

Code:
#another attempt at pulling more than one page

    def append_page(self, soup, appendtag, position, surl):
        pager = soup.find('p', attrs={'align':'right'})
        if pager:
          nextpages = soup.findAll('p', attrs={'align':'right'})
          nextpage = nextpages[1]
          if nextpage and (nextpage['href'] != surl):
              nexturl = nextpage['href']
              soup2 = self.index_to_soup(nexturl)
              texttag = soup2.find('p', attrs={'align':'right'})
              for it in texttag.findAll(style=True):
                  del it['style']
              newpos = len(texttag.contents)
              self.append_page(soup2,texttag,newpos,nexturl)
              texttag.extract()
              pager.extract()
              appendtag.insert(position,texttag)

    def preprocess_html(self, soup):
        self.append_page(soup, soup.body, 3, '')
        pager = soup.find('div', attrs={'id':'pages'})
        if pager:
          pager.extract()
        return self.adeify_images(soup)
Camper65 is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
64GB MicroSDXC fully working in PE and Leger Calibre/Dropsync discussion helf enTourage eDGe 33 01-18-2014 04:50 PM
Instapaper recipe not working as of this week largeboulder Recipes 7 09-04-2013 03:59 AM
Books not being fully deleted bloodfyr Devices 5 02-03-2012 06:41 AM
Touch How to tell when the Touch is fully charged? TonyToews Kobo Reader 0 06-27-2011 11:20 PM
Information Week: e-Book Readers Need To Get A Lot Cheaper ekaser News 7 09-08-2009 09:35 AM


All times are GMT -4. The time now is 03:34 PM.


MobileRead.com is a privately owned, operated and funded community.