Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 06-12-2012, 02:56 PM   #1
terminalveracity
Member
terminalveracity got an A in P-Chem.terminalveracity got an A in P-Chem.terminalveracity got an A in P-Chem.terminalveracity got an A in P-Chem.terminalveracity got an A in P-Chem.terminalveracity got an A in P-Chem.terminalveracity got an A in P-Chem.terminalveracity got an A in P-Chem.terminalveracity got an A in P-Chem.terminalveracity got an A in P-Chem.terminalveracity got an A in P-Chem.
 
Posts: 18
Karma: 6000
Join Date: Jun 2012
Device: Kindle Keyboard 3G
Mod for Smithsonian to clean up--problem removing untagged text

This is my first time trying this out, hopefully I haven't butchered things too badly.

I've removed the comments and cleaned out most of the extraneous stuff using this:
Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup

class SmithsonianMagazine(BasicNewsRecipe):
    title          = u'Smithsonian Magazine'
    language       = 'en'
    __author__     = 'Krittika Goyal (mod by TerminalVeracity)'
    oldest_article = 31#days
    max_articles_per_feed = 50
    use_embedded_content = False
    recursions = 1
    cover_url = 'http://sphotos.xx.fbcdn.net/hphotos-snc7/431147_10150602715983253_764313347_n.jpg'
    match_regexps = ['&page=[2-9]$']
    #preprocess_regexps = [
        (re.compile(r'<p style="font-family: Arial, Helvetica, sans-serif; font-weight:bold; color:#000;">&nbsp; &nbsp; <a href="/subArticleBottomWeb" style="color:#900">Subscribe now</a> for more of Smithsonian\'s coverage on history, science and nature. </p>', re.I|re.DOTALL), lambda match:''),
		]

    remove_stylesheets = True
    remove_tags_after  = dict(name='div', attrs={'class':['post','articlePaginationWrapper']})
    remove_tags = [
       dict(name='iframe'),
       dict(name='div', attrs={'class':'article_sidebar_border'}),
       dict(name='div', attrs={'id':['article_sidebar_border', 'most-popular_large', 'most-popular-body_large']}),
       dict(name='ul', attrs={'class':'cat-breadcrumb col three last'}),
	   dict(name='div', attrs={'class':'addtoany_share_save_container'}),
	   dict(name='div', attrs={'class':'meta'}),
	   dict(name='div', attrs={'class':'social'}),
	   dict(name='h4', attrs={'id':'related-topics'}),
	   dict(name='table'),
	   dict(name='div', attrs={'class':'OUTBRAIN'}),
	   dict(name='div', attrs={'id':'comment_section'}),
	   dict(name='div', attrs={'id':'article-related'}),
	   dict(name='div', attrs={'class':'related-articles-inpage'}),
    ]


    feeds          = [
('History and Archeology',
 'http://feeds.feedburner.com/smithsonianmag/history-archaeology'),
('People and Places',
 'http://feeds.feedburner.com/smithsonianmag/people-places'),
('Science and Nature',
 'http://feeds.feedburner.com/smithsonianmag/science-nature'),
('Arts and Culture',
 'http://feeds.feedburner.com/smithsonianmag/arts-culture'),
('Travel',
 'http://feeds.feedburner.com/smithsonianmag/travel'),
]

    def preprocess_html(self, soup):
        story = soup.find(name='div', attrs={'id':'article-body'})
        soup = BeautifulSoup('<html><head><title>t</title></head><body></body></html>')
        body = soup.find(name='body')
        body.insert(0, story)
        return soup
However, there's one short bit of text that shows up in some articles. It's untagged and I haven't been able to figure out how to remove it via regex:
Code:
<p style="font-family: Arial, Helvetica, sans-serif; font-weight:bold; color:#000;">&nbsp; &nbsp; <a href="/subArticleBottomMag" style="color:#900">Subscribe now</a> for more of Smithsonian's coverage on history, science and nature. </p>
Any hints on how to get rid of this last bit of cruft?

Also, thanks Kovid for an awesome program!
terminalveracity is offline   Reply With Quote
Old 06-12-2012, 11:52 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 25,296
Karma: 4961457
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Code:
(re.compile(r'<p.*?Subscribe now</a>.*?for more of Smithsonian.*?</p>', re.DOTALL), lambda m: '')
kovidgoyal is online now   Reply With Quote
 
Enthusiast
Old 06-14-2012, 01:47 PM   #3
terminalveracity
Member
terminalveracity got an A in P-Chem.terminalveracity got an A in P-Chem.terminalveracity got an A in P-Chem.terminalveracity got an A in P-Chem.terminalveracity got an A in P-Chem.terminalveracity got an A in P-Chem.terminalveracity got an A in P-Chem.terminalveracity got an A in P-Chem.terminalveracity got an A in P-Chem.terminalveracity got an A in P-Chem.terminalveracity got an A in P-Chem.
 
Posts: 18
Karma: 6000
Join Date: Jun 2012
Device: Kindle Keyboard 3G
Thanks for the hint Kovid.

Fixes:
cover image
clean up unwanted text (comments, ads, menus)
better text formatting
remove text overlapping images on K3

Here's the updated recipe:

Spoiler:
Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup

class SmithsonianMagazine(BasicNewsRecipe):
    title          = u'Smithsonian Magazine'
    language       = 'en'
    __author__     = 'Krittika Goyal (mod by TerminalVeracity)'
    oldest_article = 31#days
    max_articles_per_feed = 50
    use_embedded_content = False
    recursions = 1
    cover_url = 'http://sphotos.xx.fbcdn.net/hphotos-snc7/431147_10150602715983253_764313347_n.jpg'
    match_regexps = ['&page=[2-9]$']
    preprocess_regexps = [
        (re.compile(r'for more of Smithsonian\'s coverage on history, science and nature.', re.DOTALL), lambda m: '')
		]
    extra_css             = """
                               h1{font-size: large; margin: .2em 0}
                               h2{font-size: medium; margin: .2em 0}
                               h3{font-size: medium; margin: .2em 0}
                               #byLine{margin: .2em 0}
                               .articleImageCaptionwide{font-style: italic}
                               .wp-caption-text{font-style: italic}
                               img{display: block}
                            """


    remove_stylesheets = True
    remove_tags_after  = dict(name='div', attrs={'class':['post','articlePaginationWrapper']})
    remove_tags = [
       dict(name='iframe'),
       dict(name='div', attrs={'class':['article_sidebar_border','viewMorePhotos','addtoany_share_save_container','meta','social','OUTBRAIN','related-articles-inpage']}),
       dict(name='div', attrs={'id':['article_sidebar_border', 'most-popular_large', 'most-popular-body_large','comment_section','article-related']}),
       dict(name='ul', attrs={'class':'cat-breadcrumb col three last'}),
	   dict(name='h4', attrs={'id':'related-topics'}),
	   dict(name='table'),
       dict(name='a', attrs={'href':['/subArticleBottomWeb','/subArticleTopWeb','/subArticleTopMag','/subArticleBottomMag']}),
	   dict(name='a', attrs={'name':'comments_shaded'}),
    ]


    feeds          = [
('History and Archeology',
 'http://feeds.feedburner.com/smithsonianmag/history-archaeology'),
('People and Places',
 'http://feeds.feedburner.com/smithsonianmag/people-places'),
('Science and Nature',
 'http://feeds.feedburner.com/smithsonianmag/science-nature'),
('Arts and Culture',
 'http://feeds.feedburner.com/smithsonianmag/arts-culture'),
('Travel',
 'http://feeds.feedburner.com/smithsonianmag/travel'),
]

    def preprocess_html(self, soup):
        story = soup.find(name='div', attrs={'id':'article-body'})
        soup = BeautifulSoup('<html><head><title>t</title></head><body></body></html>')
        body = soup.find(name='body')
        body.insert(0, story)
        return soup

Last edited by terminalveracity; 06-14-2012 at 05:58 PM.
terminalveracity is offline   Reply With Quote
Old 06-14-2012, 03:26 PM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 25,296
Karma: 4961457
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
You should have recursions = 1 in that. recursion= 10 means that links are followed upto depth 10, you only want links followed at depth 1
kovidgoyal is online now   Reply With Quote
Old 06-14-2012, 05:59 PM   #5
terminalveracity
Member
terminalveracity got an A in P-Chem.terminalveracity got an A in P-Chem.terminalveracity got an A in P-Chem.terminalveracity got an A in P-Chem.terminalveracity got an A in P-Chem.terminalveracity got an A in P-Chem.terminalveracity got an A in P-Chem.terminalveracity got an A in P-Chem.terminalveracity got an A in P-Chem.terminalveracity got an A in P-Chem.terminalveracity got an A in P-Chem.
 
Posts: 18
Karma: 6000
Join Date: Jun 2012
Device: Kindle Keyboard 3G
Perfect and thanks. (updated the previous post)
terminalveracity is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Help removing bold text tecweston Sigil 5 02-08-2012 12:33 PM
Removing text from an ebook mjt57 Conversion 3 04-29-2011 02:55 AM
Tool for removing line breaks in text documents kahn10 Sony Reader 9 08-22-2010 10:05 PM
Tool to easily clean and refurbish html-text before conversion Pulp Workshop 3 10-13-2008 10:16 AM
Viewing Untagged PDFs on Palm T|X Eroica Reading and Management 3 12-10-2007 01:44 PM


All times are GMT -4. The time now is 09:08 AM.


MobileRead.com is a privately owned, operated and funded community.