Custom recipes (archive, read-only) - Page 164

Starson17 · 08-14-2010, 07:23 PM

Quote:

Originally Posted by soothsayer

is there anyway to do this more simply?

That was one way to scrape the image from a page that has that image. It's more or less guaranteed to have the image you want. If you don't want to scrape it, what do you want to do? Do you just want to build the URL from the current date? Will the current date produce a valid URL in all cases?

When I want to do something like build the URL, I usually scrape the text for the year/month/day off of the pages I'm scraping to build the ebook. Do you already have the year/month/day text you need to construct the URL?

soothsayer · 08-15-2010, 12:44 AM

I ended up borrowing the " def get_cover_url(self):" code from the new york times top stories basic recipe.

Code:

import time
class AdvancedUserRecipe1281810521(BasicNewsRecipe):
    title          = u'NY Daily News'
    __author__ = 'you'

    description           = 'News from NY Daily News'
    language              = 'en'
    publisher             = 'NY Daily News'
    category              = 'news, politics, sports, ny'
    oldest_article        = 7
    max_articles_per_feed = 100
    no_stylesheets        = True


    extra_css = '.art_header      {text-align:    left;}\n    \
                 .byline        {font-family:   monospace;  \
                                 text-align:    left;       \
                                 margin-top:    0px;        \
                                 margin-bottom: 0px;}\n     \
                 .datestamp_update      {font-size:     small;      \
                                 margin-top:    0px;        \
                                 margin-bottom: 0px;}\n     \
                 .art_img_lrg_txt      {text-align:    left;       \
                                 font-style:    italic;}\n  \
                 .art_img_lrg         {text-align:    center;}\n  \
                 .art_img_lrg_credit        {text-align:    right;      \
                                 font-size:     small;      \
                                 margin-top:    0px;        \
                                 margin-bottom: 0px;}\n     \
                 .art_story   {text-align:    left;}\n    \ '


    def get_cover_url(self):
        cover = None
        st = time.localtime()
        year = str(st.tm_year)
        month = "%.2d" % st.tm_mon
        day = "%.2d" % st.tm_mday
        cover = 'http://assets.nydailynews.com/img/' + year + '/' +  month +'/' + day +'/gal_frontpage_' + month + day +'.jpg'
        br = BasicNewsRecipe.get_browser()
        try:
            br.open(cover)
        except:
            self.log("\nCover unavailable")
            cover = None
        return cover


    encoding              = 'utf-8'



    oldest_article = 7
    max_articles_per_feed = 100

    keep_only_tags    = [
                       dict(name='div', attrs={'id':['art_story']})
                        ]
    remove_tags = [
                       dict(name='div', attrs={'class':['code_module']})
                  ]

    feeds = [(u'Top Stories', u'http://www.nydailynews.com/index_rss.xml'), 
             (u'News', u'http://www.nydailynews.com/news/index_rss.xml'),
             (u'NY Crime', u'http://www.nydailynews.com/news/ny_crime/index_rss.xml'), 
			 (u'NY Local', u'http://www.nydailynews.com/ny_local/index_rss.xml'),
			 (u'Politics', u'http://www.nydailynews.com/news/politics/index_rss.xml'),
			 (u'Music', u'http://www.nydailynews.com/entertainment/music/index_rss.xml'),
             (u'Arts', u'http://www.nydailynews.com/entertainment/arts/index_rss.xml'),
			 (u'Food and Dining', u'http://www.nydailynews.com/lifestyle/food/index_rss.xml'),
			 (u'Lifestyle', u'http://www.nydailynews.com/lifestyle/index_rss.xml'),
			 (u'Health/Well Being', u'http://www.nydailynews.com/lifestyle/health/index_rss.xml'),
			 (u'Sports', u'http://www.nydailynews.com/sports/index_rss.xml'),
             ]

more feeds at http://www.nydailynews.com/services/...ols/index.html

cisaak · 08-15-2010, 07:58 AM

Quote:

Originally Posted by Starson17

That's part of what I meant when I said you didn't give enough information. You often need to remove a few items from inside what was kept. Without looking at the site, I can't advise on the best way to get only the first one or remove the second.

Is there a class or id label inside the two h1 tags that differs between them?

Or, you could just give me a link to an article and I'll check it out. Alternatively, there are more powerful/complicated ways to keep only the first h1 tag.

Thanks again for the help.

Inside the first h1 tag there is:
<a title="(text of different headline)" href="/">(text of headline I want)</a>

Nothing inside the second h1 tag.

This applies to any article in the online version of the St Louis Post-Dispatch.

Starson17 · 08-15-2010, 08:23 AM

Quote:

Originally Posted by cisaak

Thanks again for the help.

Inside the first h1 tag there is:
<a title="(text of different headline)" href="/">(text of headline I want)</a>

Nothing inside the second h1 tag.

This applies to any article in the online version of the St Louis Post-Dispatch.

Use FireFox and FireBug to find a tag containing the <h1> tag you don't want then just use remove_tags to remove it.

It looks to me like you've got it backwards. I think you want to keep the second tag, the one without the <a> tag. The second one is the title for your article.

Try this:

Code:

remove_tags= [dict(name='div', attrs={'id':'blox-header'})]

cisaak · 08-15-2010, 11:20 AM

Quote:

Originally Posted by Starson17

Use FireFox and FireBug to find a tag containing the <h1> tag you don't want then just use remove_tags to remove it.

It looks to me like you've got it backwards. I think you want to keep the second tag, the one without the <a> tag. The second one is the title for your article.

Try this:

Code:

remove_tags= [dict(name='div', attrs={'id':'blox-header'})]

Did not work. The first instance of h1 is in the following division:

<div class="grid_4" id="blox-logo">

I've tried:

remove_tags= [dict(name='div', attrs={'class':'grid_4'})]

and

remove_tags= [dict(name='div', attrs={'id':'blox-logo'})]

but neither worked. Any suggestions?

Starson17 · 08-15-2010, 12:50 PM

Quote:

Originally Posted by cisaak

Did not work. The first instance of h1 is in the following division:
<div class="grid_4" id="blox-logo">
I've tried:
remove_tags= [dict(name='div', attrs={'class':'grid_4'})]
and
remove_tags= [dict(name='div', attrs={'id':'blox-logo'})]
but neither worked. Any suggestions?

Post your recipe. Use CODE and SPOILER tags. I'll test it.

miangue · 08-17-2010, 07:12 PM

Let's see if someone can help me. I made this recipe and I get as they want it. The only problem he has is that the title comes with the same font size for the article and I wish to come out bigger and bold. How could it? ...

Thanks for the help and here I leave the recipe:

Quote:

class AdvancedUserRecipe1282021339(BasicNewsRecipe):
title = u'Semana.com'
oldest_article = 7
max_articles_per_feed = 100
use_embedded_content = False
encoding = 'utf-8'
no_stylesheets = True

keep_only_tags = [
dict(name='div', attrs={'class':['titular_articulo', 'texto_autor_articulo', 'hora']})
,dict(attrs={'class':['texto_articulo']})
]

remove_tags = [
dict(name='div', attrs={'class':['cont_control']})
]

feeds = [(u'Noticias', u'http://www.semana.com/rss/Semana_OnLine.xml')
]

TonytheBookworm · 08-18-2010, 12:07 AM

I'm trying to learn how to make my own recipes. Trying to follow the tutorial but I'm a little lost. I downloaded a python editor and then entered the following code:

Code:

class AdvancedUserRecipe1282103072(BasicNewsRecipe):
    title          = u'AJC'
    oldest_article = 1
    max_articles_per_feed = 100
    no_stylesheets = True
    
    feeds          = [(u'Breaking News', u'http://www.ajc.com/genericList-rss.do?source=61499'), (u'News Q & A', u'http://www.ajc.com/genericList-rss.do?source=77197'), (u'Metro and Georgia', u'http://www.ajc.com/section-rss.do?source=news'), (u'Cobb County', u'http://www.ajc.com/section-rss.do?source=cobb'), (u'Opinion', u'http://www.ajc.com/section-rss.do?source=opinion')]

I get of course a list of links to the articles which is fine. But I want to get the actual articles as well. So I read a little more about using the print version() function. The question I have is how can I get the url to the print version since it is dynamic.

I thought maybe adding :

Code:

 
def get_article_url(self, article):

        url = article.get('guid', None)

        if 'podcasts' in url or 'surveys' in url:
            url = None

        return url

Then I want to convert the returned url from above lets say its spits back
http://www.ajc.com/news/atlanta/memo...rss_news_61499

I would assume I would want to use some form of a reg expression to trim everything after the ? and replace it with printArticle=y

but i'm confused cause this is all new to me

Code:

def print_version(self, url):
    return url.replace(url+'?printArticle=y')

is that even close to being right?

Any help would be appreciated...thank you so much..

JvdW · 08-18-2010, 03:20 AM

Hi All,

I'm hoping Kwetal is still following this thread since one of its recipes has gone haywire.
It the nrcnext recipe and its failing with the following error:

Spoiler:

Code:

Fetch news from nrcnext
Resolved conversion options
calibre version: 0.7.14
{'asciiize': False,
 'author_sort': None,
 'authors': None,
 'base_font_size': 0,
 'book_producer': None,
 'change_justification': 'original',
 'chapter': None,
 'chapter_mark': 'pagebreak',
 'comments': None,
 'cover': None,
 'debug_pipeline': None,
 'disable_font_rescaling': False,
 'dont_download_recipe': False,
 'dont_split_on_page_breaks': True,
 'extra_css': None,
 'extract_to': None,
 'flow_size': 260,
 'font_size_mapping': None,
 'footer_regex': '(?i)(?<=<hr>)((\\s*<a name=\\d+></a>((<img.+?>)*<br>\\s*)?\\d+<br>\\s*.*?\\s*)|(\\s*<a name=\\d+></a>((<img.+?>)*<br>\\s*)?.*?<br>\\s*\\d+))(?=<br>)',
 'header_regex': '(?i)(?<=<hr>)((\\s*<a name=\\d+></a>((<img.+?>)*<br>\\s*)?\\d+<br>\\s*.*?\\s*)|(\\s*<a name=\\d+></a>((<img.+?>)*<br>\\s*)?.*?<br>\\s*\\d+))(?=<br>)',
 'input_encoding': None,
 'input_profile': <calibre.customize.profiles.InputProfile object at 0x03C8E270>,
 'insert_blank_line': False,
 'insert_metadata': False,
 'isbn': None,
 'keep_ligatures': False,
 'language': None,
 'level1_toc': None,
 'level2_toc': None,
 'level3_toc': None,
 'line_height': 0,
 'linearize_tables': False,
 'lrf': False,
 'margin_bottom': 5.0,
 'margin_left': 5.0,
 'margin_right': 5.0,
 'margin_top': 5.0,
 'max_toc_links': 50,
 'no_chapters_in_toc': False,
 'no_default_epub_cover': False,
 'no_inline_navbars': False,
 'no_svg_cover': False,
 'output_profile': <calibre.customize.profiles.SonyReaderOutput object at 0x03C8E610>,
 'page_breaks_before': None,
 'password': None,
 'prefer_metadata_cover': False,
 'preprocess_html': False,
 'preserve_cover_aspect_ratio': False,
 'pretty_print': True,
 'pubdate': None,
 'publisher': None,
 'rating': None,
 'read_metadata_from_opf': None,
 'remove_first_image': False,
 'remove_footer': False,
 'remove_header': False,
 'remove_paragraph_spacing': False,
 'remove_paragraph_spacing_indent_size': 1.5,
 'series': None,
 'series_index': None,
 'tags': None,
 'test': False,
 'timestamp': None,
 'title': None,
 'title_sort': None,
 'toc_filter': None,
 'toc_threshold': 6,
 'use_auto_toc': False,
 'username': None,
 'verbose': 2}
InputFormatPlugin: Recipe Input running

Synthesizing mastheadImage
Python function terminated unexpectedly
  list index out of range (Error Code: 1)
Traceback (most recent call last):
  File "site.py", line 103, in main
  File "site.py", line 85, in run_entry_point
  File "site-packages\calibre\utils\ipc\worker.py", line 99, in main
  File "site-packages\calibre\gui2\convert\gui_conversion.py", line 24, in gui_convert
  File "site-packages\calibre\ebooks\conversion\plumber.py", line 815, in run
  File "site-packages\calibre\customize\conversion.py", line 207, in __call__
  File "site-packages\calibre\web\feeds\input.py", line 104, in convert
  File "site-packages\calibre\web\feeds\news.py", line 707, in download
  File "site-packages\calibre\web\feeds\news.py", line 870, in build_index
IndexError: list index out of range

First I thought it had todo with the fact that one of the rss feeds has changed but editing the recipe didn't help.
Maybe a lot more has changed than only that but even debugging the recipe with -vv didn't show more info then the above.

I'm using Calibre-0.714

Regards,

Joop

yawlidi · 08-18-2010, 03:53 AM

Does anybody have a recipe for Tor.com?
http://www.tor.com/

cisaak · 08-18-2010, 11:50 AM

Quote:

Originally Posted by Starson17

Post your recipe. Use CODE and SPOILER tags. I'll test it.

I can paste my recipe but am unfamiliar with CODE and SPOILER tags. Can you explain?

Three remaining goals:
1. Headline outputs twice. Want to remove one.
2. Change masthead from Kindle generic. Used the following without success:
def get_masthead_title(self)
return 'mystring'
3. Add new page command before every h1. Tried this but got error message:
h1 {page_break_before:always}

sde · 08-18-2010, 01:21 PM

Does anybody have a recipe for Pumbed (http://www.ncbi.nlm.nih.gov/pubmed) to be used in Calibre so that I can get the topics cleanly. I have created a RSS for lung cancer:
http://eutils.ncbi.nlm.nih.gov/entre...pUadKjxg6iRImT

I would like to get the title, journal and authors in different lines in the "Section Menu". The abstract pages below has duplicated titles. Otherwise it is fine.

Thanks in advance.

SD

Starson17 · 08-18-2010, 02:21 PM

Quote:

Originally Posted by cisaak

I can paste my recipe but am unfamiliar with CODE and SPOILER tags. Can you explain?

The CODE tag is the hash mark/pound symbol on the toolbar when you're replying. The SPOILER tag is the eye with an X in it on the same bar. Just paste your code, highlight it, then hit the code button, followed by the spoiler button. The code tag preserves essential formatting. The spoiler tag compresses it so others don't have to see it all, even if it's long.

Starson17 · 08-18-2010, 02:52 PM

Quote:

Originally Posted by TonytheBookworm

I'm trying to learn how to make my own recipes. Trying to follow the tutorial but I'm a little lost. I downloaded a python editor and then entered the following code:

Spoiler:

The code looks OK to me.

Quote:

I get of course a list of links to the articles which is fine. But I want to get the actual articles as well.

I don't understand this part. I checked, and your feeds pull the articles fine. They also pull lots of other junk, but that's normal until you either remove that junk in the recipe, or use the recipe to pull the print version, which is designed to have less junk.

Quote:

So I read a little more about using the print version() function. The question I have is how can I get the url to the print version since it is dynamic.

I'm not sure why you say it's "dynamic" - it looks normally static to me.

Quote:

Code:

def print_version(self, url):
    return url.replace(url+'?printArticle=y')

is that even close to being right?

Not bad, but, as you said, you need to remove the material after the "?" before adding your string.

Here is code that I tested on a few of your links. It should work.

Code:

    def print_version(self, url):
        return url.partition('?')[0] +'?printArticle=y'

TonytheBookworm · 08-18-2010, 05:47 PM

Quote:

Originally Posted by Starson17

The code looks OK to me.

I don't understand this part. I checked, and your feeds pull the articles fine. They also pull lots of other junk, but that's normal until you either remove that junk in the recipe, or use the recipe to pull the print version, which is designed to have less junk.

I'm not sure why you say it's "dynamic" - it looks normally static to me.

Not bad, but, as you said, you need to remove the material after the "?" before adding your string.

Here is code that I tested on a few of your links. It should work.

Code:

    def print_version(self, url):
        return url.partition('?')[0] +'?printArticle=y'

Thanks I will give it a shot. As for the dynamic part i figured the url itself would constantly change then after thinking about it and seeing your post I realized well duh I parse for the url

Anyway I'm learning so lets just laugh together at the mess ups. Thanks again

08-18-2010, 12:07 AM	#2453
TonytheBookworm Addict Posts: 264 Karma: 62 Join Date: May 2010 Device: kindle 2, kindle 3, Kindle fire	I'm trying to learn how to make my own recipes. Trying to follow the tutorial but I'm a little lost. I downloaded a python editor and then entered the following code: Code: class AdvancedUserRecipe1282103072(BasicNewsRecipe): title = u'AJC' oldest_article = 1 max_articles_per_feed = 100 no_stylesheets = True feeds = [(u'Breaking News', u'http://www.ajc.com/genericList-rss.do?source=61499'), (u'News Q & A', u'http://www.ajc.com/genericList-rss.do?source=77197'), (u'Metro and Georgia', u'http://www.ajc.com/section-rss.do?source=news'), (u'Cobb County', u'http://www.ajc.com/section-rss.do?source=cobb'), (u'Opinion', u'http://www.ajc.com/section-rss.do?source=opinion')] I get of course a list of links to the articles which is fine. But I want to get the actual articles as well. So I read a little more about using the print version() function. The question I have is how can I get the url to the print version since it is dynamic. I thought maybe adding : Code: def get_article_url(self, article): url = article.get('guid', None) if 'podcasts' in url or 'surveys' in url: url = None return url Then I want to convert the returned url from above lets say its spits back http://www.ajc.com/news/atlanta/memo...rss_news_61499 I would assume I would want to use some form of a reg expression to trim everything after the ? and replace it with printArticle=y but i'm confused cause this is all new to me Code: def print_version(self, url): return url.replace(url+'?printArticle=y') is that even close to being right? Any help would be appreciated...thank you so much..

08-18-2010, 01:21 PM	#2457
sde Junior Member Posts: 1 Karma: 10 Join Date: Aug 2010 Device: none	Does anybody have a recipe for Pumbed (http://www.ncbi.nlm.nih.gov/pubmed) to be used in Calibre so that I can get the topics cleanly. I have created a RSS for lung cancer: http://eutils.ncbi.nlm.nih.gov/entre...pUadKjxg6iRImT I would like to get the title, journal and authors in different lines in the "Section Menu". The abstract pages below has duplicated titles. Otherwise it is fine. Thanks in advance. SD Last edited by sde; 08-18-2010 at 01:29 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Custom column read ?	pchrist7	Calibre	2	10-04-2010 02:52 AM
Archive for custom screensavers	sleeplessdave	Amazon Kindle	1	07-07-2010 12:33 PM
How to back up preferences and custom recipes?	greenapple	Calibre	3	03-29-2010 05:08 AM
Donations for Custom Recipes	ddavtian	Calibre	5	01-23-2010 04:54 PM
Help understanding custom recipes	andersent	Calibre	0	12-17-2009 02:37 PM

08-18-2010, 03:53 AM	#2455
yawlidi Junior Member Posts: 1 Karma: 10 Join Date: Aug 2010 Device: Kindle DX	Does anybody have a recipe for Tor.com? http://www.tor.com/