Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Closed Thread
 
Thread Tools Search this Thread
Old 08-14-2010, 07:23 PM   #2446
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by soothsayer View Post
is there anyway to do this more simply?
That was one way to scrape the image from a page that has that image. It's more or less guaranteed to have the image you want. If you don't want to scrape it, what do you want to do? Do you just want to build the URL from the current date? Will the current date produce a valid URL in all cases?

When I want to do something like build the URL, I usually scrape the text for the year/month/day off of the pages I'm scraping to build the ebook. Do you already have the year/month/day text you need to construct the URL?
Starson17 is offline  
Old 08-15-2010, 12:44 AM   #2447
soothsayer
Member
soothsayer began at the beginning.
 
Posts: 13
Karma: 34
Join Date: Jul 2010
Device: hanlin, astak the 2010 version plz.
I ended up borrowing the " def get_cover_url(self):" code from the new york times top stories basic recipe.



Code:
import time
class AdvancedUserRecipe1281810521(BasicNewsRecipe):
    title          = u'NY Daily News'
    __author__ = 'you'

    description           = 'News from NY Daily News'
    language              = 'en'
    publisher             = 'NY Daily News'
    category              = 'news, politics, sports, ny'
    oldest_article        = 7
    max_articles_per_feed = 100
    no_stylesheets        = True


    extra_css = '.art_header      {text-align:    left;}\n    \
                 .byline        {font-family:   monospace;  \
                                 text-align:    left;       \
                                 margin-top:    0px;        \
                                 margin-bottom: 0px;}\n     \
                 .datestamp_update      {font-size:     small;      \
                                 margin-top:    0px;        \
                                 margin-bottom: 0px;}\n     \
                 .art_img_lrg_txt      {text-align:    left;       \
                                 font-style:    italic;}\n  \
                 .art_img_lrg         {text-align:    center;}\n  \
                 .art_img_lrg_credit        {text-align:    right;      \
                                 font-size:     small;      \
                                 margin-top:    0px;        \
                                 margin-bottom: 0px;}\n     \
                 .art_story   {text-align:    left;}\n    \ '


    def get_cover_url(self):
        cover = None
        st = time.localtime()
        year = str(st.tm_year)
        month = "%.2d" % st.tm_mon
        day = "%.2d" % st.tm_mday
        cover = 'http://assets.nydailynews.com/img/' + year + '/' +  month +'/' + day +'/gal_frontpage_' + month + day +'.jpg'
        br = BasicNewsRecipe.get_browser()
        try:
            br.open(cover)
        except:
            self.log("\nCover unavailable")
            cover = None
        return cover


    encoding              = 'utf-8'



    oldest_article = 7
    max_articles_per_feed = 100

    keep_only_tags    = [
                       dict(name='div', attrs={'id':['art_story']})
                        ]
    remove_tags = [
                       dict(name='div', attrs={'class':['code_module']})
                  ]

    feeds = [(u'Top Stories', u'http://www.nydailynews.com/index_rss.xml'), 
             (u'News', u'http://www.nydailynews.com/news/index_rss.xml'),
             (u'NY Crime', u'http://www.nydailynews.com/news/ny_crime/index_rss.xml'), 
			 (u'NY Local', u'http://www.nydailynews.com/ny_local/index_rss.xml'),
			 (u'Politics', u'http://www.nydailynews.com/news/politics/index_rss.xml'),
			 (u'Music', u'http://www.nydailynews.com/entertainment/music/index_rss.xml'),
             (u'Arts', u'http://www.nydailynews.com/entertainment/arts/index_rss.xml'),
			 (u'Food and Dining', u'http://www.nydailynews.com/lifestyle/food/index_rss.xml'),
			 (u'Lifestyle', u'http://www.nydailynews.com/lifestyle/index_rss.xml'),
			 (u'Health/Well Being', u'http://www.nydailynews.com/lifestyle/health/index_rss.xml'),
			 (u'Sports', u'http://www.nydailynews.com/sports/index_rss.xml'),
             ]
more feeds at http://www.nydailynews.com/services/...ols/index.html
soothsayer is offline  
Old 08-15-2010, 07:58 AM   #2448
cisaak
Member
cisaak began at the beginning.
 
Posts: 17
Karma: 10
Join Date: Aug 2010
Device: Kindle DX
Quote:
Originally Posted by Starson17 View Post
That's part of what I meant when I said you didn't give enough information. You often need to remove a few items from inside what was kept. Without looking at the site, I can't advise on the best way to get only the first one or remove the second.

Is there a class or id label inside the two h1 tags that differs between them?

Or, you could just give me a link to an article and I'll check it out. Alternatively, there are more powerful/complicated ways to keep only the first h1 tag.
Thanks again for the help.

Inside the first h1 tag there is:
<a title="(text of different headline)" href="/">(text of headline I want)</a>

Nothing inside the second h1 tag.

This applies to any article in the online version of the St Louis Post-Dispatch.
cisaak is offline  
Old 08-15-2010, 08:23 AM   #2449
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by cisaak View Post
Thanks again for the help.

Inside the first h1 tag there is:
<a title="(text of different headline)" href="/">(text of headline I want)</a>

Nothing inside the second h1 tag.

This applies to any article in the online version of the St Louis Post-Dispatch.
Use FireFox and FireBug to find a tag containing the <h1> tag you don't want then just use remove_tags to remove it.

It looks to me like you've got it backwards. I think you want to keep the second tag, the one without the <a> tag. The second one is the title for your article.

Try this:
Code:
remove_tags= [dict(name='div', attrs={'id':'blox-header'})]
Starson17 is offline  
Old 08-15-2010, 11:20 AM   #2450
cisaak
Member
cisaak began at the beginning.
 
Posts: 17
Karma: 10
Join Date: Aug 2010
Device: Kindle DX
Quote:
Originally Posted by Starson17 View Post
Use FireFox and FireBug to find a tag containing the <h1> tag you don't want then just use remove_tags to remove it.

It looks to me like you've got it backwards. I think you want to keep the second tag, the one without the <a> tag. The second one is the title for your article.

Try this:
Code:
remove_tags= [dict(name='div', attrs={'id':'blox-header'})]
Did not work. The first instance of h1 is in the following division:

<div class="grid_4" id="blox-logo">

I've tried:

remove_tags= [dict(name='div', attrs={'class':'grid_4'})]

and

remove_tags= [dict(name='div', attrs={'id':'blox-logo'})]

but neither worked. Any suggestions?
cisaak is offline  
Old 08-15-2010, 12:50 PM   #2451
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by cisaak View Post
Did not work. The first instance of h1 is in the following division:
<div class="grid_4" id="blox-logo">
I've tried:
remove_tags= [dict(name='div', attrs={'class':'grid_4'})]
and
remove_tags= [dict(name='div', attrs={'id':'blox-logo'})]
but neither worked. Any suggestions?
Post your recipe. Use CODE and SPOILER tags. I'll test it.
Starson17 is offline  
Old 08-17-2010, 07:12 PM   #2452
miangue
Junior Member
miangue began at the beginning.
 
miangue's Avatar
 
Posts: 4
Karma: 10
Join Date: Aug 2010
Location: Colombia
Device: Sony PRS-300
Let's see if someone can help me. I made this recipe and I get as they want it. The only problem he has is that the title comes with the same font size for the article and I wish to come out bigger and bold. How could it? ...

Thanks for the help and here I leave the recipe:

Quote:
class AdvancedUserRecipe1282021339(BasicNewsRecipe):
title = u'Semana.com'
oldest_article = 7
max_articles_per_feed = 100
use_embedded_content = False
encoding = 'utf-8'
no_stylesheets = True

keep_only_tags = [
dict(name='div', attrs={'class':['titular_articulo', 'texto_autor_articulo', 'hora']})
,dict(attrs={'class':['texto_articulo']})
]

remove_tags = [
dict(name='div', attrs={'class':['cont_control']})
]

feeds = [(u'Noticias', u'http://www.semana.com/rss/Semana_OnLine.xml')
]

miangue is offline  
Old 08-18-2010, 12:07 AM   #2453
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
I'm trying to learn how to make my own recipes. Trying to follow the tutorial but I'm a little lost. I downloaded a python editor and then entered the following code:
Code:
class AdvancedUserRecipe1282103072(BasicNewsRecipe):
    title          = u'AJC'
    oldest_article = 1
    max_articles_per_feed = 100
    no_stylesheets = True
    
    feeds          = [(u'Breaking News', u'http://www.ajc.com/genericList-rss.do?source=61499'), (u'News Q & A', u'http://www.ajc.com/genericList-rss.do?source=77197'), (u'Metro and Georgia', u'http://www.ajc.com/section-rss.do?source=news'), (u'Cobb County', u'http://www.ajc.com/section-rss.do?source=cobb'), (u'Opinion', u'http://www.ajc.com/section-rss.do?source=opinion')]
I get of course a list of links to the articles which is fine. But I want to get the actual articles as well. So I read a little more about using the print version() function. The question I have is how can I get the url to the print version since it is dynamic.

I thought maybe adding :
Code:
 
def get_article_url(self, article):

        url = article.get('guid', None)

        if 'podcasts' in url or 'surveys' in url:
            url = None

        return url
Then I want to convert the returned url from above lets say its spits back
http://www.ajc.com/news/atlanta/memo...rss_news_61499

I would assume I would want to use some form of a reg expression to trim everything after the ? and replace it with printArticle=y

but i'm confused cause this is all new to me

Code:
def print_version(self, url):
    return url.replace(url+'?printArticle=y')
is that even close to being right?

Any help would be appreciated...thank you so much..
TonytheBookworm is offline  
Old 08-18-2010, 03:20 AM   #2454
JvdW
Zealot
JvdW doesn't litterJvdW doesn't litter
 
Posts: 115
Karma: 150
Join Date: Jul 2008
Location: Netherlands Veenendaal
Device: Palm T5, Sony PRS-505, Nook Color
Hi All,

I'm hoping Kwetal is still following this thread since one of its recipes has gone haywire.
It the nrcnext recipe and its failing with the following error:
Spoiler:

Code:
Fetch news from nrcnext
Resolved conversion options
calibre version: 0.7.14
{'asciiize': False,
 'author_sort': None,
 'authors': None,
 'base_font_size': 0,
 'book_producer': None,
 'change_justification': 'original',
 'chapter': None,
 'chapter_mark': 'pagebreak',
 'comments': None,
 'cover': None,
 'debug_pipeline': None,
 'disable_font_rescaling': False,
 'dont_download_recipe': False,
 'dont_split_on_page_breaks': True,
 'extra_css': None,
 'extract_to': None,
 'flow_size': 260,
 'font_size_mapping': None,
 'footer_regex': '(?i)(?<=<hr>)((\\s*<a name=\\d+></a>((<img.+?>)*<br>\\s*)?\\d+<br>\\s*.*?\\s*)|(\\s*<a name=\\d+></a>((<img.+?>)*<br>\\s*)?.*?<br>\\s*\\d+))(?=<br>)',
 'header_regex': '(?i)(?<=<hr>)((\\s*<a name=\\d+></a>((<img.+?>)*<br>\\s*)?\\d+<br>\\s*.*?\\s*)|(\\s*<a name=\\d+></a>((<img.+?>)*<br>\\s*)?.*?<br>\\s*\\d+))(?=<br>)',
 'input_encoding': None,
 'input_profile': <calibre.customize.profiles.InputProfile object at 0x03C8E270>,
 'insert_blank_line': False,
 'insert_metadata': False,
 'isbn': None,
 'keep_ligatures': False,
 'language': None,
 'level1_toc': None,
 'level2_toc': None,
 'level3_toc': None,
 'line_height': 0,
 'linearize_tables': False,
 'lrf': False,
 'margin_bottom': 5.0,
 'margin_left': 5.0,
 'margin_right': 5.0,
 'margin_top': 5.0,
 'max_toc_links': 50,
 'no_chapters_in_toc': False,
 'no_default_epub_cover': False,
 'no_inline_navbars': False,
 'no_svg_cover': False,
 'output_profile': <calibre.customize.profiles.SonyReaderOutput object at 0x03C8E610>,
 'page_breaks_before': None,
 'password': None,
 'prefer_metadata_cover': False,
 'preprocess_html': False,
 'preserve_cover_aspect_ratio': False,
 'pretty_print': True,
 'pubdate': None,
 'publisher': None,
 'rating': None,
 'read_metadata_from_opf': None,
 'remove_first_image': False,
 'remove_footer': False,
 'remove_header': False,
 'remove_paragraph_spacing': False,
 'remove_paragraph_spacing_indent_size': 1.5,
 'series': None,
 'series_index': None,
 'tags': None,
 'test': False,
 'timestamp': None,
 'title': None,
 'title_sort': None,
 'toc_filter': None,
 'toc_threshold': 6,
 'use_auto_toc': False,
 'username': None,
 'verbose': 2}
InputFormatPlugin: Recipe Input running

Synthesizing mastheadImage
Python function terminated unexpectedly
  list index out of range (Error Code: 1)
Traceback (most recent call last):
  File "site.py", line 103, in main
  File "site.py", line 85, in run_entry_point
  File "site-packages\calibre\utils\ipc\worker.py", line 99, in main
  File "site-packages\calibre\gui2\convert\gui_conversion.py", line 24, in gui_convert
  File "site-packages\calibre\ebooks\conversion\plumber.py", line 815, in run
  File "site-packages\calibre\customize\conversion.py", line 207, in __call__
  File "site-packages\calibre\web\feeds\input.py", line 104, in convert
  File "site-packages\calibre\web\feeds\news.py", line 707, in download
  File "site-packages\calibre\web\feeds\news.py", line 870, in build_index
IndexError: list index out of range

First I thought it had todo with the fact that one of the rss feeds has changed but editing the recipe didn't help.
Maybe a lot more has changed than only that but even debugging the recipe with -vv didn't show more info then the above.

I'm using Calibre-0.714

Regards,

Joop
JvdW is offline  
Old 08-18-2010, 03:53 AM   #2455
yawlidi
Junior Member
yawlidi began at the beginning.
 
Posts: 1
Karma: 10
Join Date: Aug 2010
Device: Kindle DX
Does anybody have a recipe for Tor.com?
http://www.tor.com/
yawlidi is offline  
Old 08-18-2010, 11:50 AM   #2456
cisaak
Member
cisaak began at the beginning.
 
Posts: 17
Karma: 10
Join Date: Aug 2010
Device: Kindle DX
Quote:
Originally Posted by Starson17 View Post
Post your recipe. Use CODE and SPOILER tags. I'll test it.
I can paste my recipe but am unfamiliar with CODE and SPOILER tags. Can you explain?

Three remaining goals:
1. Headline outputs twice. Want to remove one.
2. Change masthead from Kindle generic. Used the following without success:
def get_masthead_title(self)
return 'mystring'
3. Add new page command before every h1. Tried this but got error message:
h1 {page_break_before:always}
cisaak is offline  
Old 08-18-2010, 01:21 PM   #2457
sde
Junior Member
sde began at the beginning.
 
Posts: 1
Karma: 10
Join Date: Aug 2010
Device: none
Does anybody have a recipe for Pumbed (http://www.ncbi.nlm.nih.gov/pubmed) to be used in Calibre so that I can get the topics cleanly. I have created a RSS for lung cancer:
http://eutils.ncbi.nlm.nih.gov/entre...pUadKjxg6iRImT

I would like to get the title, journal and authors in different lines in the "Section Menu". The abstract pages below has duplicated titles. Otherwise it is fine.

Thanks in advance.

SD

Last edited by sde; 08-18-2010 at 01:29 PM.
sde is offline  
Old 08-18-2010, 02:21 PM   #2458
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by cisaak View Post
I can paste my recipe but am unfamiliar with CODE and SPOILER tags. Can you explain?
The CODE tag is the hash mark/pound symbol on the toolbar when you're replying. The SPOILER tag is the eye with an X in it on the same bar. Just paste your code, highlight it, then hit the code button, followed by the spoiler button. The code tag preserves essential formatting. The spoiler tag compresses it so others don't have to see it all, even if it's long.

Last edited by Starson17; 08-18-2010 at 02:57 PM.
Starson17 is offline  
Old 08-18-2010, 02:52 PM   #2459
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
I'm trying to learn how to make my own recipes. Trying to follow the tutorial but I'm a little lost. I downloaded a python editor and then entered the following code:
Spoiler:
Code:
class AdvancedUserRecipe1282103072(BasicNewsRecipe):
    title          = u'AJC'
    oldest_article = 1
    max_articles_per_feed = 100
    no_stylesheets = True
    
    feeds          = [(u'Breaking News', u'http://www.ajc.com/genericList-rss.do?source=61499'), (u'News Q & A', u'http://www.ajc.com/genericList-rss.do?source=77197'), (u'Metro and Georgia', u'http://www.ajc.com/section-rss.do?source=news'), (u'Cobb County', u'http://www.ajc.com/section-rss.do?source=cobb'), (u'Opinion', u'http://www.ajc.com/section-rss.do?source=opinion')]
The code looks OK to me.

Quote:
I get of course a list of links to the articles which is fine. But I want to get the actual articles as well.
I don't understand this part. I checked, and your feeds pull the articles fine. They also pull lots of other junk, but that's normal until you either remove that junk in the recipe, or use the recipe to pull the print version, which is designed to have less junk.

Quote:
So I read a little more about using the print version() function. The question I have is how can I get the url to the print version since it is dynamic.
I'm not sure why you say it's "dynamic" - it looks normally static to me.

Quote:
Code:
def print_version(self, url):
    return url.replace(url+'?printArticle=y')
is that even close to being right?
Not bad, but, as you said, you need to remove the material after the "?" before adding your string.

Here is code that I tested on a few of your links. It should work.
Code:
    def print_version(self, url):
        return url.partition('?')[0] +'?printArticle=y'

Last edited by Starson17; 08-18-2010 at 03:01 PM.
Starson17 is offline  
Old 08-18-2010, 05:47 PM   #2460
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:
Originally Posted by Starson17 View Post
The code looks OK to me.



I don't understand this part. I checked, and your feeds pull the articles fine. They also pull lots of other junk, but that's normal until you either remove that junk in the recipe, or use the recipe to pull the print version, which is designed to have less junk.



I'm not sure why you say it's "dynamic" - it looks normally static to me.



Not bad, but, as you said, you need to remove the material after the "?" before adding your string.

Here is code that I tested on a few of your links. It should work.
Code:
    def print_version(self, url):
        return url.partition('?')[0] +'?printArticle=y'
Thanks I will give it a shot. As for the dynamic part i figured the url itself would constantly change then after thinking about it and seeing your post I realized well duh I parse for the url Anyway I'm learning so lets just laugh together at the mess ups. Thanks again
TonytheBookworm is offline  
Closed Thread


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Custom column read ? pchrist7 Calibre 2 10-04-2010 02:52 AM
Archive for custom screensavers sleeplessdave Amazon Kindle 1 07-07-2010 12:33 PM
How to back up preferences and custom recipes? greenapple Calibre 3 03-29-2010 05:08 AM
Donations for Custom Recipes ddavtian Calibre 5 01-23-2010 04:54 PM
Help understanding custom recipes andersent Calibre 0 12-17-2009 02:37 PM


All times are GMT -4. The time now is 12:03 PM.


MobileRead.com is a privately owned, operated and funded community.