Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 04-25-2013, 11:33 PM   #1
Camper65
Enthusiast
Camper65 began at the beginning.
 
Posts: 32
Karma: 10
Join Date: Apr 2011
Device: Kindle wifi; Dell 2in1
Question need help in trying to update .net recipe

I'm trying to rewrite the default .net recipe, it seems the way the feeds url is handled changed and it no longer works right anymore.

I've been playing with it, I'm comfortable with HTML and CSS, some php and java but didn't really study python but am getting to understand more of it with doing all this.


This one gets the title of the article and sometimes the descriptions of the articles from the newsfeed but doesn't go any further to pass the actual URL of to the article so that it can pull the whole article.

Spoiler:
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
from calibre.web.feeds.news import BasicNewsRecipe
import re

class NetMagazineRecipe (BasicNewsRecipe):
__author__ = u'Marc Busqué <marc@lamarciana.com>'
__url__ = 'http://www.lamarciana.com'
__version__ = '1.0'
__license__ = 'GPL v3'
__copyright__ = u'2012, Marc Busqué <marc@lamarciana.com>'
title = u'.net magazine Custom'
description = u'net is the world’s best-selling magazine for web designers and developers, featuring tutorials from leading agencies, interviews with the web’s biggest names, and agenda-setting features on the hottest issues affecting the internet today.'
language = 'en'
tags = 'web development, software'
oldest_article = 7
remove_empty_feeds = True
no_stylesheets = True
auto_cleanup = True
cover_url = u'http://media.netmagazine.futurecdn.net/sites/all/themes/netmag/logo.png'
# remove_tags_above = dict(id='header')
# remove_tags_below = [dict(name='footer')]

# keep_only_tags = [
# dict(name='article', attrs={'class': re.compile('^node.*$', re.IGNORECASE)}),
# ]
# remove_tags = [
# dict(name='span', attrs={'class': 'comment-count'}),
# dict(name='div', attrs={'class': 'item-list share-links'}),
# dict(name='footer'),
# ]
# remove_attributes = ['border', 'cellspacing', 'align', 'cellpadding', 'colspan', 'valign', 'vspace', 'hspace', #'alt', 'width', 'height', 'style']
# extra_css = 'img {max-width: 100%; display: block; margin: auto;} .captioned-image div {text-align: center; #font-style: italic;}'

feeds = [
(u'.net', u'http://feeds.feedburner.com/net/topstories?format=xml'),
]



(I have commented out the tag area until I can get it working then can modify it to what is needed and not needed).

In trying to get it to pass the url of the feedburner entry I'm trying the following:

Spoiler:
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
from calibre.web.feeds.news import BasicNewsRecipe
import re

class NetMagazineRecipe (BasicNewsRecipe):
__author__ = u'Marc Busqué <marc@lamarciana.com>'
__url__ = 'http://www.lamarciana.com'
__version__ = '1.0'
__license__ = 'GPL v3'
__copyright__ = u'2012, Marc Busqué <marc@lamarciana.com>'
title = u'.net magazine Custom'
description = u'net is the world’s best-selling magazine for web designers and developers, featuring tutorials from leading agencies, interviews with the web’s biggest names, and agenda-setting features on the hottest issues affecting the internet today.'
language = 'en'
tags = 'web development, software'
oldest_article = 7
remove_empty_feeds = True
no_stylesheets = True
auto_cleanup = True
cover_url = u'http://media.netmagazine.futurecdn.net/sites/all/themes/netmag/logo.png'
# remove_tags_above = dict(id='header')
# remove_tags_below = [dict(name='footer')]

# keep_only_tags = [
# dict(name='article', attrs={'class': re.compile('^node.*$', re.IGNORECASE)}),
# ]
# remove_tags = [
# dict(name='span', attrs={'class': 'comment-count'}),
# dict(name='div', attrs={'class': 'item-list share-links'}),
# dict(name='footer'),
# ]
# remove_attributes = ['border', 'cellspacing', 'align', 'cellpadding', 'colspan', 'valign', 'vspace', 'hspace', #'alt', 'width', 'height', 'style']
# extra_css = 'img {max-width: 100%; display: block; margin: auto;} .captioned-image div {text-align: center; #font-style: italic;}'

feeds = [
(u'.net', u'http://feeds.feedburner.com/net/topstories?format=xml'),
]

def get_article_url(self, article):

url = article.get('link', None)

return url


Can anyone help me adjust how to pass the url so that the recipe can convert the feed to an actual URL so that it can download the articles. Unfortunately there are no print versions of these articles so the original must be used. Thanks.
Camper65 is offline   Reply With Quote
Old 04-29-2013, 07:58 PM   #2
Camper65
Enthusiast
Camper65 began at the beginning.
 
Posts: 32
Karma: 10
Join Date: Apr 2011
Device: Kindle wifi; Dell 2in1
Red face

Found part of the solution, at least now the documents are downloading, now to clean it up before it creates a ebook version. It needed a complete rewrite of the original recipe. Since it's a rewrite, I'm putting my info into it.

So far the code is as follows:

Code:
from calibre.web.feeds.news import BasicNewsRecipe

class dotnetMagazine (BasicNewsRecipe):
    __author__ = u'Bonni Salles'
    __version__ = '1.0'
    __license__   = 'GPL v3'
    __copyright__ = u'2013, Bonni Salles'
    title                 = '.net '
    oldest_article        = 7
    no_stylesheets        = True
    encoding              = 'utf8'
    use_embedded_content  = False
    language              = 'en'
    remove_empty_feeds    = True
    extra_css             = ' body{font-family: Arial,Helvetica,sans-serif } img{margin-bottom: 0.4em} '

#   remove_tags_above = dict(id='header')
#   remove_tags_below = [dict(name='footer')]

#   keep_only_tags = [
#         dict(name='article', attrs={'class': re.compile('^node.*$', re.IGNORECASE)}),
#         ]
#   remove_tags = [
#         dict(name='span', attrs={'class': 'comment-count'}),
#         dict(name='div', attrs={'class': 'item-list share-links'}),
#         dict(name='footer'),
#         ]
#   remove_attributes = ['border', 'cellspacing', 'align', 'cellpadding', 'colspan', 'valign', 'vspace', 'hspace', #'alt', 'width', 'height', 'style']
#   extra_css = 'img {max-width: 100%; display: block; margin: auto;} .captioned-image div {text-align: center; #font-style: italic;}'


    feeds = [
               (u'net', u'http://feeds.feedburner.com/net/topstories')
            ]
Now to read on how to remove tags before it processing the html, there's a lot on the page that is not needed. It took a week to figure out that the recipe needed the complete rewrite.
Camper65 is offline   Reply With Quote
Advert
Old 05-10-2013, 12:01 AM   #3
Camper65
Enthusiast
Camper65 began at the beginning.
 
Posts: 32
Karma: 10
Join Date: Apr 2011
Device: Kindle wifi; Dell 2in1
Got it, at least it's almost a perfect recipe. Right now it still show the comments when they are there, but most of the top and bottom have been eliminated. I'd love to figure out how to use remove_tags_before when there is no id or class for it --> <header> is how the header is designated and I couldn't get it to work right, ended up using remove_tags for almost everything.

Here is the recipe at this point in time.

Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag, BeautifulStoneSoup


class dotnetMagazine (BasicNewsRecipe):
    __author__ = u'Bonni Salles - post in forum if questions for me'
    __version__ = '1.0'
    __license__   = 'GPL v3'
    __copyright__ = u'2013, Bonni Salles'
    title                 = '.net '
    oldest_article        = 7
    no_stylesheets        = True
    encoding              = 'utf8'
    use_embedded_content  = False
    language              = 'en'
    remove_empty_feeds    = True
    extra_css             = ' body{font-family: Arial,Helvetica,sans-serif } img{margin-bottom: 0.4em} '

    remove_tags_after = [
         dict(name='div', attrs={'class': 'footer-content'}),
          ]

    #remove_tags_before = [
    #     dict(name='div', attrs={'id': 'main-content'}),
    #     ]
          
    remove_tags = [
         dict(name='div', attrs={'class': 'item-list'}),
         dict(name=['header','footer']),
         dict(attrs={'class':re.compile('(^|| )menu($|| )', re.DOTALL)}),
         dict(name='h4', attrs={'class': 'std-hdr'}),
         dict(name=['script', 'noscript']),
         dict(name='div', attrs={'id': 'comments-form'}),
         dict(name='div', attrs={'id': re.compile('advertorial_block_($|| )')}),
         dict(name='div', attrs={'id': 'right-col'}),

         ]
#   remove_attributes = ['border', 'cellspacing', 'align', 'cellpadding', 'colspan', 'valign', 'vspace', 'hspace', #'alt', 'width', 'height', 'style']
#   extra_css = 'img {max-width: 100%; display: block; margin: auto;} .captioned-image div {text-align: center; #font-style: italic;}'


    feeds = [
               (u'net', u'http://feeds.feedburner.com/net/topstories')
            ]

Here is the area of the article that I'm trying to work with

Code:
</ul>					</nav>
                </div>

                <div id="main-content">
                  <div id="content" >
                     
                     
                                                                  
                     
                     
                     <article class="node node-news sticky" >

   <header>
                           <h1 class="title"><span>GAAD 2013 wants accessibility on web devs' minds</span></h1>
               
      <div class="submitted" >
                     By <span class="author-name">Craig Grannell</span> on <time datetime="2013-05-09T11:35:55+00:00" >May 09, 2013</time>                             <div class="item-list share-links" ><h3>Share this article</h3><ul><li class="twitter-button first"><a href="http://twitter.com/share?url=http://www.netmagazine.com/news/gaad-2013-wants-accessibility-web-devs-minds-132742&amp;text=GAAD%202013%20wants%20accessibility%20on%20web%20devs%27%20minds | .net" class="twitter-share-button" data-count="none">Tweet</a></li>
<li class="facebook-button"><iframe src="http://www.facebook.com/plugins/like.php?href=http://www.netmagazine.com/news/gaad-2013-wants-accessibility-web-devs-minds-132742&amp;send=false&amp;layout=button_count&amp;width=47&amp;show_faces=false&amp;action=like&amp;colorscheme=light&amp;font=arial&amp;height=21" scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:47px; height:21px;" allowTransparency="true"></iframe></li>
<li class="googleplus-button"><g:plusone size="medium" count="false" ></g:plusone></li>
<li class="linkedin-button"><script type="in/share"  ></script> </li>
<li class="shorturl-button inactive last"><div class="shorturl-box"><input type="text" value="http://netm.ag/15MLtx1" /><div class="shorturl-close"></div></div><span class="shorturl-link">Short url</span></li>
</ul></div>              </div>
          
   </header>
   
   <div class="content">
Can someone please tell me what to do to get the remove_tags_before to work. There is also an area with <header id="header"> that is in the beginning which is not where I want to have the article start from.
Camper65 is offline   Reply With Quote
Old 05-10-2013, 12:02 AM   #4
Camper65
Enthusiast
Camper65 began at the beginning.
 
Posts: 32
Karma: 10
Join Date: Apr 2011
Device: Kindle wifi; Dell 2in1
Success in creating a recipe to handle .net

Got it, at least it's almost a perfect recipe. Right now it still show the comments when they are there, but most of the top and bottom have been eliminated. I'd love to figure out how to use remove_tags_before when there is no id or class for it --> <header> is how the header is designated and I couldn't get it to work right, ended up using remove_tags for almost everything.

Here is the recipe at this point in time.

Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag, BeautifulStoneSoup


class dotnetMagazine (BasicNewsRecipe):
    __author__ = u'Bonni Salles - post in forum if questions for me'
    __version__ = '1.0'
    __license__   = 'GPL v3'
    __copyright__ = u'2013, Bonni Salles'
    title                 = '.net '
    oldest_article        = 7
    no_stylesheets        = True
    encoding              = 'utf8'
    use_embedded_content  = False
    language              = 'en'
    remove_empty_feeds    = True
    extra_css             = ' body{font-family: Arial,Helvetica,sans-serif } img{margin-bottom: 0.4em} '

    remove_tags_after = [
         dict(name='div', attrs={'class': 'footer-content'}),
          ]

    #remove_tags_before = [
    #     dict(name='div', attrs={'id': 'main-content'}),
    #     ]
          
    remove_tags = [
         dict(name='div', attrs={'class': 'item-list'}),
         dict(name=['header','footer']),
         dict(attrs={'class':re.compile('(^|| )menu($|| )', re.DOTALL)}),
         dict(name='h4', attrs={'class': 'std-hdr'}),
         dict(name=['script', 'noscript']),
         dict(name='div', attrs={'id': 'comments-form'}),
         dict(name='div', attrs={'id': re.compile('advertorial_block_($|| )')}),
         dict(name='div', attrs={'id': 'right-col'}),

         ]
#   remove_attributes = ['border', 'cellspacing', 'align', 'cellpadding', 'colspan', 'valign', 'vspace', 'hspace', #'alt', 'width', 'height', 'style']
#   extra_css = 'img {max-width: 100%; display: block; margin: auto;} .captioned-image div {text-align: center; #font-style: italic;}'


    feeds = [
               (u'net', u'http://feeds.feedburner.com/net/topstories')
            ]

Here is the area of the article that I'm trying to work with

Code:
</ul>					</nav>
                </div>

                <div id="main-content">
                  <div id="content" >
                     
                     
                                                                  
                     
                     
                     <article class="node node-news sticky" >

   <header>
                           <h1 class="title"><span>GAAD 2013 wants accessibility on web devs' minds</span></h1>
               
      <div class="submitted" >
                     By <span class="author-name">Craig Grannell</span> on <time datetime="2013-05-09T11:35:55+00:00" >May 09, 2013</time>                             <div class="item-list share-links" ><h3>Share this article</h3><ul><li class="twitter-button first"><a href="http://twitter.com/share?url=http://www.netmagazine.com/news/gaad-2013-wants-accessibility-web-devs-minds-132742&amp;text=GAAD%202013%20wants%20accessibility%20on%20web%20devs%27%20minds | .net" class="twitter-share-button" data-count="none">Tweet</a></li>
<li class="facebook-button"><iframe src="http://www.facebook.com/plugins/like.php?href=http://www.netmagazine.com/news/gaad-2013-wants-accessibility-web-devs-minds-132742&amp;send=false&amp;layout=button_count&amp;width=47&amp;show_faces=false&amp;action=like&amp;colorscheme=light&amp;font=arial&amp;height=21" scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:47px; height:21px;" allowTransparency="true"></iframe></li>
<li class="googleplus-button"><g:plusone size="medium" count="false" ></g:plusone></li>
<li class="linkedin-button"><script type="in/share"  ></script> </li>
<li class="shorturl-button inactive last"><div class="shorturl-box"><input type="text" value="http://netm.ag/15MLtx1" /><div class="shorturl-close"></div></div><span class="shorturl-link">Short url</span></li>
</ul></div>              </div>
          
   </header>
   
   <div class="content">
Can someone please tell me what to do to get the remove_tags_before to work. There is also an area with <header id="header"> that is in the beginning which is not where I want to have the article start from.
Camper65 is offline   Reply With Quote
Old 05-10-2013, 02:04 AM   #5
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,776
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
remove_tags_before = dict(name='header', id=lambda x:not x)

will match a <header> tag with no id.
kovidgoyal is offline   Reply With Quote
Advert
Old 05-11-2013, 09:13 PM   #6
Camper65
Enthusiast
Camper65 began at the beginning.
 
Posts: 32
Karma: 10
Join Date: Apr 2011
Device: Kindle wifi; Dell 2in1
Success

Kovid, thank you, that's what was needed.

The recipe is now fixed and works. Here is the final version if you want to use it in the program.


Spoiler:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag, BeautifulStoneSoup


class dotnetMagazine (BasicNewsRecipe):
__author__ = u'Bonni Salles - post in forum if questions for me'
__version__ = '1.0'
__license__ = 'GPL v3'
__copyright__ = u'2013, Bonni Salles'
title = '.net '
oldest_article = 7
no_stylesheets = True
encoding = 'utf8'
use_embedded_content = False
language = 'en'
remove_empty_feeds = True
extra_css = ' body{font-family: Arial,Helvetica,sans-serif } img{margin-bottom: 0.4em} '
cover_url = u'http://media.netmagazine.futurecdn.net/sites/all/themes/netmag/logo.png'

remove_tags_after = dict(name='footer', id=lambda x:not x)
remove_tags_before = dict(name='header', id=lambda x:not x)


remove_tags = [
dict(name='div', attrs={'class': 'item-list'}),
dict(name='h4', attrs={'class': 'std-hdr'}),
dict(name='div', attrs={'class': 'item-list share-links'}), #removes share links
dict(name=['script', 'noscript']),
dict(name='div', attrs={'id': 'comments-form'}), #comment these out if you want the comments to show
dict(name='div', attrs={'id': re.compile('advertorial_block_($|| )')}),
dict(name='div', attrs={'id': 'right-col'}),
dict(name='div', attrs={'id': 'comments'}), #comment these out if you want the comments to show
dict(name='div', attrs={'class': 'item-list related-content'}),

]

feeds = [
(u'net', u'http://feeds.feedburner.com/net/topstories')
]

Last edited by Camper65; 05-11-2013 at 11:39 PM.
Camper65 is offline   Reply With Quote
Old 05-26-2013, 08:50 AM   #7
Camper65
Enthusiast
Camper65 began at the beginning.
 
Posts: 32
Karma: 10
Join Date: Apr 2011
Device: Kindle wifi; Dell 2in1
New issue and I'm not sure it's just a problem this week or not. I had to add recursion = 1 to force it to download the article. The feed site now has an ad page that apparently comes up first and then you have to "Click here to continue to article". How can I have it automatically avoid that first page or in other words, get the right link from that ad page?

(here's the modified recipe to try)

Spoiler:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag, BeautifulStoneSoup


class dotnetMagazine (BasicNewsRecipe):
__author__ = u'Bonni Salles - post in forum if questions for me'
__version__ = '1.0'
__license__ = 'GPL v3'
__copyright__ = u'2013, Bonni Salles'
title = '.net magazine'
oldest_article = 7
no_stylesheets = True
recursions = 1
encoding = 'utf8'
use_embedded_content = False
language = 'en'
remove_empty_feeds = True
extra_css = ' body{font-family: Arial,Helvetica,sans-serif } img{margin-bottom: 0.4em} '
cover_url = u'http://media.netmagazine.futurecdn.net/sites/all/themes/netmag/logo.png'

remove_tags_after = dict(name='footer', id=lambda x:not x)
remove_tags_before = dict(name='header', id=lambda x:not x)


remove_tags = [
dict(name='div', attrs={'class': 'item-list'}),
dict(name='h4', attrs={'class': 'std-hdr'}),
dict(name='div', attrs={'class': 'item-list share-links'}),
dict(name=['script', 'noscript']),
dict(name='div', attrs={'id': 'comments-form'}),
dict(name='div', attrs={'id': re.compile('advertorial_block_($|| )')}),
dict(name='div', attrs={'id': 'right-col'}),
dict(name='div', attrs={'id': 'comments'}),
dict(name='div', attrs={'class': 'item-list related-content'}),

]

feeds = [
(u'net', u'http://feeds.feedburner.com/net/topstories')
]
Camper65 is offline   Reply With Quote
Old 05-26-2013, 09:04 AM   #8
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,776
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
You need to use the obfuscated articles infrastructure in the recipe. articles_are_obfuscated = True and them implement get_obfuscated_article() in your recipe.
kovidgoyal is offline   Reply With Quote
Old 05-26-2013, 02:15 PM   #9
Camper65
Enthusiast
Camper65 began at the beginning.
 
Posts: 32
Karma: 10
Join Date: Apr 2011
Device: Kindle wifi; Dell 2in1
Kovid,

I added the following to the test recipe and it's not working.

At the top end with the other true/false, etc. area
articles_are_obfuscated = True

Just before the feeds:
def get_obfuscated_article(self, url):
raise NotImplementedError

and it get the following in the recipe.txt printout:

Spoiler:
Resolved conversion options
calibre version: 0.9.28
{'asciiize': False,
'author_sort': None,
'authors': None,
'base_font_size': 0,
'book_producer': None,
'change_justification': 'original',
'chapter': None,
'chapter_mark': 'pagebreak',
'comments': None,
'cover': None,
'debug_pipeline': None,
'dehyphenate': True,
'delete_blank_paragraphs': True,
'disable_font_rescaling': False,
'dont_download_recipe': False,
'duplicate_links_in_toc': False,
'embed_font_family': None,
'enable_heuristics': False,
'extra_css': None,
'filter_css': None,
'fix_indents': True,
'font_size_mapping': None,
'format_scene_breaks': True,
'html_unwrap_factor': 0.4,
'input_encoding': None,
'input_profile': <calibre.customize.profiles.InputProfile object at 0x021E59B0>,
'insert_blank_line': False,
'insert_blank_line_size': 0.5,
'insert_metadata': False,
'isbn': None,
'italicize_common_cases': True,
'keep_ligatures': False,
'language': None,
'level1_toc': None,
'level2_toc': None,
'level3_toc': None,
'line_height': 0,
'linearize_tables': False,
'lrf': False,
'margin_bottom': 5.0,
'margin_left': 5.0,
'margin_right': 5.0,
'margin_top': 5.0,
'markup_chapter_headings': True,
'max_toc_links': 50,
'minimum_line_height': 120.0,
'no_chapters_in_toc': False,
'no_inline_navbars': False,
'output_profile': <calibre.customize.profiles.OutputProfile object at 0x021E5B90>,
'page_breaks_before': None,
'prefer_metadata_cover': False,
'pretty_print': True,
'pubdate': None,
'publisher': None,
'rating': None,
'read_metadata_from_opf': None,
'remove_fake_margins': True,
'remove_first_image': False,
'remove_paragraph_spacing': False,
'remove_paragraph_spacing_indent_size': 1.5,
'renumber_headings': True,
'replace_scene_breaks': '',
'search_replace': None,
'series': None,
'series_index': None,
'smarten_punctuation': False,
'sr1_replace': '',
'sr1_search': '',
'sr2_replace': '',
'sr2_search': '',
'sr3_replace': '',
'sr3_search': '',
'start_reading_at': None,
'subset_embedded_fonts': False,
'tags': None,
'test': True,
'timestamp': None,
'title': None,
'title_sort': None,
'toc_filter': None,
'toc_threshold': 6,
'unsmarten_punctuation': False,
'unwrap_lines': True,
'use_auto_toc': False,
'verbose': 2}
1% Converting input to HTML...
InputFormatPlugin: Recipe Input running
Using custom recipe
1% Fetching feeds...
1% Fetching feed net...
1% Trying to download cover...
34% Downloading cover from http://media.netmagazine.futurecdn.n...etmag/logo.png
1% Generating masthead...
Synthesizing mastheadImage
1% Starting download [4 thread(s)]...
Failed to download article: Designer says 'stop using Helvetica and Arial' from http://rss.feedsportal.com/c/32632/f...69/story01.htm
Traceback (most recent call last):
File "site-packages\calibre\utils\threadpool.py", line 95, in run
File "site-packages\calibre\web\feeds\news.py", line 1107, in fetch_obfuscated_article
File "<string>", line 40, in get_obfuscated_article
NotImplementedError



17% Article download failed: Designer says 'stop using Helvetica and Arial'
Failed to download article: The .net strip #36: Roger Federer from http://rss.feedsportal.com/c/32632/f...er/story01.htm
Traceback (most recent call last):
File "site-packages\calibre\utils\threadpool.py", line 95, in run
File "site-packages\calibre\web\feeds\news.py", line 1107, in fetch_obfuscated_article
File "<string>", line 40, in get_obfuscated_article
NotImplementedError



34% Article download failed: The .net strip #36: Roger Federer
34% Feeds downloaded to G:\Users\Camper\AppData\Local\Temp\calibre_hm5czb\ m_zrcy_plumber\index.html
34% Download finished
Failed to download the following articles:
Designer says 'stop using Helvetica and Arial' from net
http://rss.feedsportal.com/c/32632/f...69/story01.htm
Traceback (most recent call last):
File "site-packages\calibre\utils\threadpool.py", line 95, in run
File "site-packages\calibre\web\feeds\news.py", line 1107, in fetch_obfuscated_article
File "<string>", line 40, in get_obfuscated_article
NotImplementedError

The .net strip #36: Roger Federer from net
http://rss.feedsportal.com/c/32632/f...er/story01.htm
Traceback (most recent call last):
File "site-packages\calibre\utils\threadpool.py", line 95, in run
File "site-packages\calibre\web\feeds\news.py", line 1107, in fetch_obfuscated_article
File "<string>", line 40, in get_obfuscated_article
NotImplementedError

Parsing all content...
Parsing index.html ...
Forcing index.html into XHTML namespace
Parsing feed_0/index.html ...
Initial parse failed, using more forgiving parsers
Parsing feed_0/index.html as HTML
Reading TOC from NCX...
34% Running transforms on ebook...
Merging user specified metadata...
Detecting structure...
Flattening CSS and remapping font sizes...
Source base font size is 12.00000pt
Removing fake margins...
Found 3 items of level: div_1
Found 2 items of level: div_2
Found 2 items of level: p_2
Ignoring level p_2
div_1 left margin stats: Counter()
div_1 right margin stats: Counter()
div_2 left margin stats: Counter()
div_2 right margin stats: Counter()
Cleaning up manifest...
Trimming unused files from manifest...
Creating OEB Output...
67% Running OEB Output plugin
The cover image has an id != "cover". Renaming to work around bug in Nook Color
OEB output written to G:\Users\Camper\Documents\Calibre Library\Testing news\myrecipe
Output saved to G:\Users\Camper\Documents\Calibre Library\Testing news\myrecipe



What am I missing in the middle of these lines?
def get_obfuscated_article(self, url):
raise NotImplementedError
Camper65 is offline   Reply With Quote
Old 05-26-2013, 11:32 PM   #10
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,776
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Look ata builtin recipe that uses get_obfuscated_article or read the API documentation for that function.
kovidgoyal is offline   Reply With Quote
Old 05-27-2013, 11:16 AM   #11
Camper65
Enthusiast
Camper65 began at the beginning.
 
Posts: 32
Karma: 10
Join Date: Apr 2011
Device: Kindle wifi; Dell 2in1
Cool Got it

Got the corrected one again. But give me a week or two to make sure that the changes feedsportal made are permanent and not just a fluke, since I now only download this once a week (they only produce articles five days a week and it's better downloading Saturday or Sunday for it to get that week's articles.

I'll let you know next week hopefully, in between installing components to my new tower I'm building.

Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag, BeautifulStoneSoup
from calibre.web.feeds import feed_from_xml, templates, feeds_from_index, Feed
from calibre.ptempfile import PersistentTemporaryFile

class dotnetMagazine (BasicNewsRecipe):
    __author__ = u'Bonni Salles - post in forum if questions for me'
    __version__ = '1.1'
    __license__   = 'GPL v3'
    __copyright__ = u'2013, Bonni Salles'
    title                 = '.net '
    oldest_article        = 7
    no_stylesheets        = True
    encoding              = 'utf8'
    use_embedded_content  = False
    recursion = 1
    articles_are_obfuscated = True
    language              = 'en'
    remove_empty_feeds    = True
    extra_css             = ' body{font-family: Arial,Helvetica,sans-serif } img{margin-bottom: 0.4em} '
    cover_url = u'http://media.netmagazine.futurecdn.net/sites/all/themes/netmag/logo.png'

    remove_tags_after = dict(name='footer', id=lambda x:not x)     
    remove_tags_before = dict(name='header', id=lambda x:not x)


    remove_tags = [
         dict(name='div', attrs={'class': 'item-list'}),
         dict(name='h4', attrs={'class': 'std-hdr'}),
         dict(name='div', attrs={'class': 'item-list share-links'}), #removes share links
         dict(name=['script', 'noscript']),
         dict(name='div', attrs={'id': 'comments-form'}), #comment these out if you want the comments to show
         dict(name='div', attrs={'id': re.compile('advertorial_block_($|| )')}),
         dict(name='div', attrs={'id': 'right-col'}),
         dict(name='div', attrs={'id': 'comments'}), #comment these out if you want the comments to show
         dict(name='div', attrs={'class': 'item-list related-content'}),

         ]
         
    feeds = [
               (u'net', u'http://feeds.feedburner.com/net/topstories?format=xml')
            ]

    temp_files = []
    
    def get_obfuscated_article(self, url):
        br = self.get_browser()
        print 'THE CURRENT URL IS: ', url
        br.open(url)
        response = br.open(url)
        html = response.read()
         
        self.temp_files.append(PersistentTemporaryFile('_fa.html'))
        self.temp_files[-1].write(html)
        self.temp_files[-1].close()
        return self.temp_files[-1].name

Camper65
Camper65 is offline   Reply With Quote
Old 05-27-2013, 11:38 AM   #12
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,776
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Dont you need to actually parse the returned html to see if it contains an ad and find the correct article in that case? Then you dont need recursion = 1 any more.
kovidgoyal is offline   Reply With Quote
Old 05-27-2013, 12:30 PM   #13
Camper65
Enthusiast
Camper65 began at the beginning.
 
Posts: 32
Karma: 10
Join Date: Apr 2011
Device: Kindle wifi; Dell 2in1
Perhaps I didn't explain the initial change properly. When you click on a feed initially (at least the first time I did it) it took me to a page with an ad in the middle and on the right hand side it had in multiple languages "Click here to continue to article" which when you click takes you directly to the article. All I saw when this week's .net downloaded was a batch of pages of the multiple languages "Click here to continue to article" (with no text for the article).

When I added Recursion = 1, it gave me the first page with the right hand side info of the multiple languages "Click here to continue to article" and after that page it gave me the actual article as a new entry. But in between each article was again the click here to continue to article page. This now gets the articles only. But that's why I'm waiting a week to see if the feed changes are permanent before I'm sure the changes are needed.

If you know of another way to get around this double clicking to get the article let me know please. If you want me to send you an epub so you an see what it is doing, let me know.
Camper65 is offline   Reply With Quote
Old 05-27-2013, 12:48 PM   #14
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,776
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Ah, I see, just ad this to your recipe:

Code:
   def skip_ad_pages(self, soup):
        text = soup.find(text='click here to continue to article')
        if text:
            a = text.parent
            url = a.get('href')
            if url:
                return self.index_to_soup(url, raw=True)
kovidgoyal is offline   Reply With Quote
Old 05-27-2013, 04:09 PM   #15
Camper65
Enthusiast
Camper65 began at the beginning.
 
Posts: 32
Karma: 10
Join Date: Apr 2011
Device: Kindle wifi; Dell 2in1
Smile

Kovid,

Thank you!!!! That did it. The articles now download like normal and do not have that extra page in there.

Here is the updated recipe again so you can use it next time you do updates to Calibre.

Also, thank you for creating such a great ebook organizer/news download program.

Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag, BeautifulStoneSoup
from calibre.web.feeds import feed_from_xml, templates, feeds_from_index, Feed
from calibre.ptempfile import PersistentTemporaryFile

class dotnetMagazine (BasicNewsRecipe):
    __author__ = u'Bonni Salles - post in forum if questions for me'
    __version__ = '1.1'
    __license__   = 'GPL v3'
    __copyright__ = u'2013, Bonni Salles'
    title                 = '.net magazine'
    oldest_article        = 7
    no_stylesheets        = True
    encoding              = 'utf8'
    use_embedded_content  = False
    #recursion = 1
    language              = 'en'
    remove_empty_feeds    = True
    extra_css             = ' body{font-family: Arial,Helvetica,sans-serif } img{margin-bottom: 0.4em} '
    cover_url = u'http://media.netmagazine.futurecdn.net/sites/all/themes/netmag/logo.png'

    remove_tags_after = dict(name='footer', id=lambda x:not x)     
    remove_tags_before = dict(name='header', id=lambda x:not x)


    remove_tags = [
         dict(name='div', attrs={'class': 'item-list'}),
         dict(name='h4', attrs={'class': 'std-hdr'}),
         dict(name='div', attrs={'class': 'item-list share-links'}), #removes share links
         dict(name=['script', 'noscript']),
         dict(name='div', attrs={'id': 'comments-form'}), #comment these out if you want the comments to show
         dict(name='div', attrs={'id': re.compile('advertorial_block_($|| )')}),
         dict(name='div', attrs={'id': 'right-col'}),
         dict(name='div', attrs={'id': 'comments'}), #comment these out if you want the comments to show
         dict(name='div', attrs={'class': 'item-list related-content'}),

         ]
         
    feeds = [
               (u'net', u'http://feeds.feedburner.com/net/topstories?format=xml')
            ]
  
    def skip_ad_pages(self, soup):
          text = soup.find(text='click here to continue to article')
          if text:
              a = text.parent
              url = a.get('href')
              if url:
                return self.index_to_soup(url, raw=True)
Camper65 is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
.net recipe suddenly not working right Camper65 Recipes 0 04-21-2013 01:16 PM
.net magazine recipe cram1010 Recipes 0 07-21-2012 09:26 AM
Modified Recipe Tweakers.net - need help roedi06 Recipes 4 01-17-2012 07:42 AM
recipe for FAZ.net - german schuster Recipes 10 05-28-2011 12:13 AM
Request: Inquirer.net Recipe update zoilom Recipes 0 12-21-2010 01:06 AM


All times are GMT -4. The time now is 10:29 PM.


MobileRead.com is a privately owned, operated and funded community.