Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes


Thread Tools Search this Thread
Old 08-24-2011, 04:18 AM   #1
rogerx doesn't litterrogerx doesn't litterrogerx doesn't litter
Posts: 29
Karma: 244
Join Date: Aug 2011
Location: North Pole, Alaska
Device: Kindle DXG
Fairbanks Daily News-miner News Recipe Submission

Here's the best I could do for the Fairbanks Daily News-miner newspaper.

I only know Bash and C well, and little Python.

I figure, when somebody else around Alaska here pulls this in when they're using Calibre, they'll have somewhat of an outline of listed bugs/anomalies to work with rather then starting from nothing and trying to find each bug/anomaly. (These are marked within TODO inline comments.)

Oh, I could pretty this up for you people taking submission to hide the anomalies, but I'm more of an honest guy and know leaving comments (if sometimes too many), is a way better method! This way, another can easily fix rather then struggling from the beginning blind.

This recipe is well commented for anybody willing to fix things more:
1) Article titles should likely be bold font?
2) Only need story_item_date and omit number of views/posts/pipe symbols.
3) Need a newline after each index/toc entry when pulling more then one RSS feed for some reason.

Sorry, I am not a fan of Python or QT!
(... and for some reason, I can't attach this file and only post inline. :-(

#import re          # Provides preprocess_regexps re.compile
#import string      # Provides self.tag_to_string
#from calibre import strftime

from import BasicNewsRecipe
#from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag, NavigableString   # Provides soup

class FairbanksDailyNewsminer(BasicNewsRecipe):
    title          = u'Fairbanks Daily News-miner'
    __author__ = 'Roger'
    oldest_article = 7
    max_articles_per_feed = 100

    description = ''''The voice of interior Alaska since 1903'''
    publisher   = ''
    category    = 'news, Alaska, Fairbanks'
    language    = 'en'
    #extra_css   = '''
    #                p{font-weight: normal;text-align: justify}
    #              '''

    remove_javascript = True
    use_embedded_content = False
    no_stylesheets = True
    language = 'en'
    encoding = 'utf8'
    conversion_options = {'linearize_tables':True}
    # TODO: I don't see any photos in my Mobi file with this masterhead_url!
    masthead_url = ''

    # In order to omit seeing number of views, number of posts and the pipe
    # symbol for divider after the title and date of the article, a regex or
    # manual processing is needed to get just the "story_item_date updated"
    # (which contains the date).  Everything else on this line is pretty much not needed.
    # HTML line containing story_item_date:
    # <div class="signature_line"><span title="2011-08-22T23:37:14Z" class="story_item_date updated">Aug 22, 2011</span>&nbsp;|&nbsp;2370&nbsp;views&nbsp;|&nbsp;52&nbsp;<a href="/pages/full_story/push?article-Officials+tout+new+South+Cushman+homeless+living+facility%20&id=15183753#comments_15183753"><img alt="52 comments" class="dont_touch_me" src="" title="52 comments" /></a>&nbsp;|&nbsp;<span id="number_recommendations_15183753" class="number_recommendations">9</span>&nbsp;<a href="#1" id="recommend_link_15183753" onclick="Element.remove('recommend_link_15183753'); new Ajax.Request('/community/content/recommend/15183753', {asynchronous:true, evalScripts:true}); return false;"><img alt="9 recommendations" class="dont_touch_me" src="" title="9 recommendations" /></a>&nbsp;|&nbsp;<a href="#1" onclick="$j.facebox({ajax: '/community/content/email_friend_pane/15183753'}); return false;"><span style="position: relative;"><img alt="email to a friend" class="dont_touch_me" src="" title="email to a friend" /></span></a>&nbsp;|&nbsp;<span><a href="/printer_friendly/15183753" target="_blank"><img alt="print" class="dont_touch_me" src="" title="print" /></a></span><span id="email_content_message_15183753" class="signature_email_message"></span></div>

    # The following was suggested, but it looks like I also need to define self & soup
    # (as well as bring in extra soup depends?)
    #date = self.tag_to_string(soup.find('span', attrs={'class':'story_item_date updated'}))

    #preprocess_regexps = [(re.compile(r'<span[^>]*addthis_separator*>'), lambda match: '') ]
    #preprocess_regexps = [(re.compile(r'span class="addthis_separator">|</span>'), lambda match: '') ]
    #preprocess_regexps = [
    #           (re.compile(r'<start>.*?<end>', re.IGNORECASE | re.DOTALL), lambda match : ''),
    #               ]
    #def get_browser(self):
    #def preprocess_html(soup, first_fetch):
    #    date = self.tag_to_string(soup.find('span', attrs={'class':'story_item_date updated'}))
    #    return

    # Try to keep some tags - some might not be needed here
    keep_only_tags = [
                        #date = self.tag_to_string(soup.find('span', attrs={'class':'story_item_date updated'})),
                        dict(name='div', attrs={'class':'hnews hentry item'}),
                        dict(name='div', attrs={'class':'story_item_headline entry-title'}),
                        #dict(name='span', attrs={'class':'story_item_date updated'}),
                        dict(name='div', attrs={'class':'full_story'})
    #remove_tags = [
    #                dict(name='div', attrs={'class':'story_tools'}),
    #                dict(name='p', attrs={'class':'ad_label'}),
    #              ]

    # Try to remove some bothersome tags
    remove_tags = [
                    #dict(name='img', attrs={'alt'}),
                    dict(name='img', attrs={'class':'dont_touch_me'}),
                    dict(name='span', attrs={'class':'number_recommendations'}),
                    #dict(name='div', attrs={'class':'signature_line'}),
                    dict(name='div', attrs={'class':'addthis_toolbox addthis_default_style'}),
                    dict(name='div', attrs={'class':['addthis_toolbox','addthis_default_style']}),
                    dict(name='span', attrs={'class':'addthis_separator'}),
                    dict(name='div', attrs={'class':'related_content'}),
                    dict(name='div', attrs={'class':'comments_container'}),
                    #dict(name='div', attrs={'class':'signature_line'}),
                    dict(name='div', attrs={'class':'addthis_toolbox addthis_default_style'}),
                    dict(name='div', attrs={'id':'comments_container'})

    # This one works but only gets title, date and clips article content!
    #remove_tags_after = [
    #                        dict(name='span', attrs={'class':'story_item_date updated'})
    #                    ]
    #remove_tags_after = [
    #                        dict(name='div', attrs={'class':'advertisement'}),
    #                    ]

    # Try clipping tags before and after to prevent pulling img views/posts numbers after date?
    #remove_tags_before = [
    #                        dict(name='span', attrs={'class':'story_item_date updated'})
    #                     ]

    #extra_css # tweak the appearance # TODO: Change article titles <h2?> to bold?

    # Comment-out or uncomment any of the following RSS feeds according to your
    # liking.
    # TODO: Adding more then one RSS Feed, and newline will be omitted for
    # entries within the Table of Contents or Index of Articles
    # TODO: Some random bits of text is trailing the last page (or TOC on MOBI
    # files), these are bits of public posts and comments and need to also be
    # removed.
    feeds = [
        (u'Alaska News', u''),
        (u'Local News', u''),
     #  (u'Business', u''),
     #  (u'Politics', u''),
     #  (u'Sports', u''),
     #  (u'Latitude 65 feed', u''),
        (u'Sundays', u''),
     #  (u'Outdoors', u''),
     #  (u'Fairbanks Grizzlies', u''),
        (u'Newsminer', u''),
     #  (u'Opinion', u''),
     #  (u'Youth', u''),
     #  (u'Dermot Cole Blog', u''),
     #  (u'Dermot Cole Column', u''),
     #  (u'Sarah Palin', u'')
rogerx is offline   Reply With Quote
Old 08-25-2011, 09:28 AM   #2
rogerx doesn't litterrogerx doesn't litterrogerx doesn't litter
Posts: 29
Karma: 244
Join Date: Aug 2011
Location: North Pole, Alaska
Device: Kindle DXG
I've finally figured-out to rename to recipe.txt to upload.
Hopefully this attaches fine without line wrap issues.

I've also updated this recipe with bold font on article titles, as well as a few other font modifications.

Now has a masterhead_url (header image for Kindle/MOBI reader devices).

There's quite a few comments embedded, but necessary if somebody wants to try editing the signature_line (|date line|num of views|num of comments||).

Actually, this is looking pretty good. Think I'll relax and read the newspaper now. As far as I'm concerned, go ahead and submit this recipe.
Attached Files
File Type: txt fairbanksdailynewsminer.txt (8.7 KB, 449 views)
rogerx is offline   Reply With Quote
Old 08-25-2011, 08:30 PM   #3
rogerx doesn't litterrogerx doesn't litterrogerx doesn't litter
Posts: 29
Karma: 244
Join Date: Aug 2011
Location: North Pole, Alaska
Device: Kindle DXG
Here's a newly updated commenting-out some feeds causing duplicate stories/articles.

1) Commented out Newminer RSS Feed - this is a feed containing all RSS feeds embedded into one URL.

2) Commented out Sundays RSS Feed - feed is for readers consistently missing Sundays news.
Attached Files
File Type: txt fairbanksdailynewsminer.txt (9.0 KB, 96 views)
rogerx is offline   Reply With Quote

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
New Fairbanks Daily News-miner News Recipe -- Need Date inclusion only rogerx Recipes 5 08-24-2011 10:12 AM
New York Daily News dabla75 Recipes 0 06-20-2011 02:09 PM
NY Daily News muggsly Recipes 1 03-21-2011 09:44 PM
Custom Daily News Recipe mean_gene Recipes 0 12-27-2010 01:07 PM

All times are GMT -4. The time now is 06:40 AM. is a privately owned, operated and funded community.