Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 09-24-2010, 02:32 PM   #1
noah
Junior Member
noah began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Sep 2010
Device: Kindle
The Bay Citizen - recipe help

Thanks to TonytheBookworm whose helpful post got me started with a recipe for The Bay Citizen.

I modified the recipe to extract content from the regular (non-print) story pages, because I wanted the pictures which aren't included in the print versions.

Spoiler:
# this block is pretty much standard on all recipes
#----------------------------------------------------------------------------------------------------------
from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1282101454(BasicNewsRecipe):
title = 'The Bay Citizen'
language = 'en'
__author__ = 'TonytheBookworm and noah'
description = 'The Bay Citizen'
publisher = 'The Bay Citizen'
category = 'news'
oldest_article = 1 # USE THIS TO DETERMINE HOW FAR BACK YOU WANNA GO IN THE FEED DATE WISE
max_articles_per_feed = 20 # USE TO DETERMINE HOW MANY ARTICLES YOU WISH TO READ PER FEED
no_stylesheets = True # TURNS OFF JAVASCRIPT

masthead_url = 'http://media.baycitizen.org/images/layout/logo1.png' #PUTS NICE LOGO ON MAIN MENU PAGE
#---------------------------------------------------------------------------------------------------------

#here we tell the recipe what feed(s) we wish to obtain
#-----------------------------------------------------------------------------------------
feeds = [
('Main Feed', 'http://www.baycitizen.org/feeds/stories/'),

]
#------------------------------------------------------------------------------------------

keep_only_tags = [dict(name='div', attrs={'class':'story'})]

remove_tags = [dict(name='div', attrs={'class':'socialBar'})]


It works pretty well, but I have two questions/problems:
  1. In both my version and Tony's, Calibre is forming the section menu using the <media:title> element from the feed, instead of the <title> element. How can I get it to use the <title> element, which is what it actually should be doing?
  2. In my version, certain stories appear as complete gobbledygook -- huge strings of strange characters. Help!?
noah is offline   Reply With Quote
Old 09-24-2010, 02:59 PM   #2
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by noah View Post
  1. In both my version and Tony's, Calibre is forming the section menu using the <media:title> element from the feed, instead of the <title> element. How can I get it to use the <title> element, which is what it actually should be doing?
  2. In my version, certain stories appear as complete gobbledygook -- huge strings of strange characters. Help!?
1) You're using the default feed and article parsing system. It parses what it parses, and you'll have to override it to get different behavior. You can use parse_index() to read the feed page and scrape any title you want.

2) I've no idea. Post some links and samples. Use "print" in your recipe to see what's happening.
Starson17 is offline   Reply With Quote
Advert
Old 10-06-2010, 04:44 PM   #3
noah
Junior Member
noah began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Sep 2010
Device: Kindle
Quote:
Originally Posted by Starson17 View Post
1) You're using the default feed and article parsing system. It parses what it parses, and you'll have to override it to get different behavior. You can use parse_index() to read the feed page and scrape any title you want.
OK, thanks. I don't think I have the skills for this, so if anyone else wants to take a stab at it, I'd be grateful!

Quote:
Originally Posted by Starson17 View Post
2) I've no idea. Post some links and samples. Use "print" in your recipe to see what's happening.
I'm attaching an example. Calibre generated this ebook based on the recipe mentioned above. See p. 43 of 63 ("Meg Whitman"). It's all garbage characters. Can you tell me why this is happening?

Also, where would I put "print" in the recipe? Sorry, I'm not very advanced and I don't see a "print" command mentioned in the documentation.
Attached Files
File Type: epub BayCitizen [Wed, 06 Oct 2010] - calibre.epub (212.7 KB, 192 views)
noah is offline   Reply With Quote
Old 03-11-2011, 04:21 AM   #4
noah
Junior Member
noah began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Sep 2010
Device: Kindle
Bay Citizen: updated multipage recipe

The Bay Citizen recently added pagination to their site. Here's a new recipe that handles articles that span multiple pages. (Adapted from the Adventure Gamers recipe).

Spoiler:
Code:
from calibre.web.feeds.news import BasicNewsRecipe

class TheBayCitizen(BasicNewsRecipe):
    title                 = 'The Bay Citizen'
    language              = 'en'
    __author__            = 'noah'
    description           = 'The Bay Citizen'
    publisher             = 'The Bay Citizen'
    INDEX                 = u'http://www.baycitizen.org'
    category              = 'news'
    oldest_article        = 2
    max_articles_per_feed = 20
    no_stylesheets        = True
    masthead_url          = 'http://media.baycitizen.org/images/layout/logo1.png'
    feeds                 = [('Main Feed', 'http://www.baycitizen.org/feeds/stories/')]
    keep_only_tags        = [dict(name='div', attrs={'class':'story'})]
    remove_tags           = [
                             dict(name='div', attrs={'class':'socialBar'}),
                             dict(name='div', attrs={'id':'text-resize'}),
                             dict(name='div', attrs={'class':'story relatedContent'}),
                             dict(name='div', attrs={'id':'comment_status_loading'}),
                            ]
    
    def append_page(self, soup, appendtag, position):
        pager = soup.find('a',attrs={'class':'stry-next'})
        if pager:
           nexturl = self.INDEX + pager['href']
           soup2 = self.index_to_soup(nexturl)
           texttag = soup2.find('div', attrs={'class':'body'})
           for it in texttag.findAll(style=True):
               del it['style']
           newpos = len(texttag.contents)
           self.append_page(soup2,texttag,newpos)
           texttag.extract()
           appendtag.insert(position,texttag)


    def preprocess_html(self, soup):
        for item in soup.findAll(style=True):
            del item['style']
        self.append_page(soup, soup.body, 3)
        garbage = soup.findAll(id='story-pagination')
        [trash.extract() for trash in garbage]
        garbage = soup.findAll('em', 'cont-from-prev')
        [trash.extract() for trash in garbage]
        return soup
noah is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Seriously thoughtful How do you plan to prove you are a citizen? rhadin Lounge 270 05-25-2010 01:48 PM
A newbie who is a senior citizen! drnvs Introduce Yourself 14 03-28-2010 04:36 PM
Other Non-Fiction Campbell, John W.: Tribesman, Barbarian, And Citizen. v1. 26 Dec 08 Dr. Drib BBeB/LRF Books (offline) 0 12-26-2008 06:52 PM
Herrick, Robert: The Memoirs of an American Citizen. v1. 11 July 07 Anais9000 BBeB/LRF Books 0 07-11-2007 01:27 PM
Citizen develops ferroelectric LCD Alexander Turcic News 2 04-14-2006 10:40 AM


All times are GMT -4. The time now is 09:07 AM.


MobileRead.com is a privately owned, operated and funded community.