Custom recipes (archive, read-only) - Page 184

Flexicat · 09-17-2010, 06:09 PM

Hello. Can someone give me some assistance in creating a recipe for a site that does not have an RSS feed?

The base url is "http://archiveofourown.org/tags/Sherlock%20(TV)/works" but the actual story titles seem to be located within HTML code that looks like this on the page;

Code:

  <!--title, author, fandom-->
    <div class="header module">
      <h4 title="title">
  	    <a href="/works/117685">Disorder</a>
   		  by
        <!-- do not cache -->
      </h4>

As a result, I cannot figure out how to extract the article ID number for use. I am guessing that I will have to parse the HTML code of the page but have never done that type of extraction before. I am not familiar with Python or Beautiful Soup.

Thanks.

krunk · 09-17-2010, 06:26 PM

I'd like to set calibre up on a cron schedule for automatic downloading without having to start up the application.

There are some stock recipes that I use to pull down news feeds and such. I noticed with the command line utility ebook-convert, you can get a list of built-in recipes.

But how do you use a built in recipe from the list?

I tried:

Code:

ebook-convert "New York Times Top Stories" nyt.mobi

But it appears ebook-convert only takes files as args. Do I need to do something like go to the Customize menu, "Customize builtin recipe", then copy and paste from the advanced into a *.recipe file, and call it that way?

TonytheBookworm · 09-17-2010, 07:07 PM

can someone help me with this regexp please? I suck at regex

Trying to find (Today's Nuze)
i tried the following with no success

Code:

preprocess_regexps = [
                          (re.compile(r'(Today)(.).*?(\\s+)(Nuze), re.DOTALL|re.IGNORECASE),lambda match: '')]

thanks

kovidgoyal · 09-17-2010, 07:13 PM

@krunk: append .recipe to the title

TonytheBookworm · 09-17-2010, 07:22 PM

here is the recipe for Nealz Nuze. The show notes for the National Syndicated (U.S.) talk show host Neal Boortz.

burbank_atl · 09-17-2010, 09:47 PM

Quote:

Originally Posted by Starson17

Read here.

I did. It isn't what I'm looking for.

I would like to use the standalone Python environment that would allow me to use the GUI. Unless I missed something, Calibre-debug would only load a CLI.

For me the real solution would be to find docs/info on installing the Calibre specific packages into the site-packages for Python. Then I can play with the recipes as just standalone python procedures.

bhandarisaurabh · 09-17-2010, 10:04 PM

hi can anyone help me with the recipe for industry week
http://www.industryweek.com/Archive.aspx

krunk · 09-17-2010, 10:29 PM

Quote:

Originally Posted by kovidgoyal

@krunk: append .recipe to the title

Thank you kovid!

TonytheBookworm · 09-17-2010, 11:43 PM

Quote:

Originally Posted by bhandarisaurabh

hi can anyone help me with the recipe for industry week
http://www.industryweek.com/Archive.aspx

This should work for the CURRENT ARTICLE MONTH/YEAR
It has a form that you select the different year but I'm not sure what the actual true urls are that it uses on that. So I just stuck with the current month year since I figured that is what you would want anyway. If you will look even though September 2010 is selected on the page the article content still says August 18 or whatever. That is the same date that is on the original page.

Anyway the only thing that I don't understand how to do is get the description to drop the text that is inside the <a>. Once that is done I will post an update.

Updated Code to do descr correctly

Spoiler:

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, re
class IW(BasicNewsRecipe):
    title      = 'Industry Week'
    __author__ = 'Tonythebookworm'
    description = ''
    language = 'en'
    no_stylesheets = True
    publisher           = 'Tonythebookworm'
    category            = 'Manufactoring'
    use_embedded_content= False
    no_stylesheets      = True
    oldest_article      = 40
    remove_javascript   = True
    remove_empty_feeds  = True
    
    max_articles_per_feed = 200 # only gets the first 200 articles
    INDEX = 'http://www.industryweek.com'
    
    
    
    remove_tags = [dict(name='div', attrs={'class':['crumbNav']}),
                   dict(name='i')]
    
    def parse_index(self):
        feeds = []
        for title, url in [
                            (u"Current Month", u"http://www.industryweek.com/Archive.aspx"),
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds
        
    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        soup = self.index_to_soup(url)
        
        for item in soup.findAll('a', attrs={'class':'article'}):
         
         link = item['href']
         soup = self.index_to_soup(url)    
         if link:
         
          url         = self.INDEX + link
          title       = self.tag_to_string(item)
          descr    = item.parent
          item.extract()
          descr       = self.tag_to_string(descr)
          #print 'the url is: ', url
          #print 'the title is: ', title
          #print 'the descr is: ', descr
          current_articles.append({'title': title, 'url': url, 'description': descr, 'date':''}) # append all this
        return current_articles
      

   
    def print_version(self, url):
        split1 = url.split("=")
        print_url = 'http://www.industryweek.com/PrintArticle.aspx?ArticleID=' + split1[1]
        
        return print_url

TonytheBookworm · 09-18-2010, 12:42 AM

If you have something like this:

Spoiler:

how would you go about getting the text inside the p?

I tried doing something like this but ended up getting the text inside the a tag as well.

Spoiler:

Starson17 · 09-18-2010, 09:43 AM

Quote:

Originally Posted by TonytheBookworm

If you have something like this:

Spoiler:

how would you go about getting the text inside the p?

Without testing, my thoughts would be:
1) grab the parent of item (<p>) and extract() item, leaving a p with self.tag_to_string of what you want, or
2) perhaps, just grab item.next.next.next

I always need to test to be sure, but one of those should work.
edit: I see your post above, I tested it and they both work.
This worked best:

Code:

          descr       = item.parent
          item.extract()
          descr       = self.tag_to_string(descr)

SilentSeven · 09-18-2010, 11:57 AM

Wondering if someone might be able to review the current Seattle Times recipe?

Current version picks up a lot of articles more than once.

It would also be fantastic if the output could have a table of contents type organization section built around the top darkblue rollover bar.

Wish I could this myself...

TonytheBookworm · 09-18-2010, 02:44 PM

Quote:

Wish I could this myself..

Give it a shot, it is not that hard to pick up. Once you get the basics down pat then it becomes pretty simple to do. And they are many good people here that will help you along the way. So take and read this thread and you will get an idea of how to do it. Heck just read the communications between starson17 and myself starting with the AJC recipe (my first one) then you will see how it is done to the most part. I think i have asked and starson17 has answered pretty much 90% or better of the issues one would face while trying to do a recipe.

Good luck and let us know if we can help. But seriously give it a shot and you will be amazed of how quick you pick it up.

TonytheBookworm · 09-18-2010, 02:52 PM

Quote:

Originally Posted by Starson17

Without testing, my thoughts would be:
1) grab the parent of item (<p>) and extract() item, leaving a p with self.tag_to_string of what you want, or
2) perhaps, just grab item.next.next.next

I always need to test to be sure, but one of those should work.
edit: I see your post above, I tested it and they both work.
This worked best:

Code:

          descr       = item.parent
          item.extract()
          descr       = self.tag_to_string(descr)

worked great thank you.

marbs · 09-18-2010, 09:40 PM

i am having trouble with my recipe:

Spoiler:

I ran ebook-convert and I think this is the relevant output:

Spoiler:

if not, pleas tell me where to look.
and thank you starson for the help so far. i think this message was posted in orderly fation

09-17-2010, 06:09 PM	#2746
Flexicat Junior Member Posts: 8 Karma: 10 Join Date: Aug 2010 Device: Kobo	Hello. Can someone give me some assistance in creating a recipe for a site that does not have an RSS feed? The base url is "http://archiveofourown.org/tags/Sherlock%20(TV)/works" but the actual story titles seem to be located within HTML code that looks like this on the page; Code: <!--title, author, fandom--> <div class="header module"> <h4 title="title"> <a href="/works/117685">Disorder</a> by <!-- do not cache --> </h4> As a result, I cannot figure out how to extract the article ID number for use. I am guessing that I will have to parse the HTML code of the page but have never done that type of extraction before. I am not familiar with Python or Beautiful Soup. Thanks.

09-17-2010, 06:26 PM	#2747
krunk Member Posts: 19 Karma: 10 Join Date: Feb 2010 Location: Los Angeles, CA Device: Kindle 3	I'd like to set calibre up on a cron schedule for automatic downloading without having to start up the application. There are some stock recipes that I use to pull down news feeds and such. I noticed with the command line utility ebook-convert, you can get a list of built-in recipes. But how do you use a built in recipe from the list? I tried: Code: ebook-convert "New York Times Top Stories" nyt.mobi But it appears ebook-convert only takes files as args. Do I need to do something like go to the Customize menu, "Customize builtin recipe", then copy and paste from the advanced into a *.recipe file, and call it that way?

09-17-2010, 07:07 PM	#2748
TonytheBookworm Addict Posts: 264 Karma: 62 Join Date: May 2010 Device: kindle 2, kindle 3, Kindle fire	can someone help me with this regexp please? I suck at regex Trying to find (Today's Nuze) i tried the following with no success Code: preprocess_regexps = [ (re.compile(r'(Today)(.).?(\\s+)(Nuze), re.DOTALL\|re.IGNORECASE),lambda match: '')] thanks Last edited by TonytheBookworm; 09-17-2010 at 07:14 PM. Reason: figured it out*

09-18-2010, 09:40 PM	#2760
marbs Zealot Posts: 122 Karma: 10 Join Date: Jul 2010 Device: nook	i am having trouble with my recipe: Spoiler: Code: class AdvancedUserRecipe1283848012(BasicNewsRecipe): description = 'TheMarker' cover_url = 'http://static.ispot.co.il/wp-content/upload/2009/09/themarker.jpg' title = u'The Marker1' language = 'he' simultaneous_downloads = 1 delay = 6 remove_javascript = True timefmt = '[%a, %d %b, %Y]' oldest_article = 1 remove_tags = [dict(name='tr', attrs={'bgcolor':['#738A94']}) ] max_articles_per_feed = 1000 extra_css='body{direction: rtl;} .article_description{direction: rtl; } a.article{direction: rtl; } .calibre_feed_description{direction: rtl; }' feeds = [(u'Head Lines', u'http://www.themarker.com/tmc/content/xml/rss/hpfeed.xml'), (u'TA Market', u'http://www.themarker.com/tmc/content/xml/rss/sections/marketfeed.xml'), (u'Real Estate', u'http://www.themarker.com/tmc/content/xml/rss/sections/realEstaterfeed.xml'), (u'Wall Street & Global', u'http://www.themarker.com/tmc/content/xml/rss/sections/wallsfeed.xml'), (u'Law', u'http://www.themarker.com/tmc/content/xml/rss/sections/lawfeed.xml'), (u'Media', u'http://www.themarker.com/tmc/content/xml/rss/sections/mediafeed.xml'), (u'Consumer', u'http://www.themarker.com/tmc/content/xml/rss/sections/consumerfeed.xml'), (u'Career', u'http://www.themarker.com/tmc/content/xml/rss/sections/careerfeed.xml'), (u'Car', u'http://www.themarker.com/tmc/content/xml/rss/sections/carfeed.xml'), (u'High Tech', u'http://www.themarker.com/tmc/content/xml/rss/sections/hightechfeed.xml'), (u'Investor Guide', u'http://www.themarker.com/tmc/content/xml/rss/sections/investorGuidefeed.xml')] def print_version(self, url): baseURL=url.replace('tmc/article.jhtml?ElementId=', 'ibo/misc/printFriendly.jhtml?ElementId=%2Fibo%2Frepositories%2Fstories%2Fm1_2000%2F') s= baseURL + '.xml' return s I ran ebook-convert and I think this is the relevant output: Spoiler: Parsing file 'feed_0/index.html' as HTML Forcing feed_0/index.html into XHTML namespace Parsing feed_1/article_0/index.html ... Forcing feed_1/article_0/index.html into XHTML namespace Parsing feed_1/article_1/index.html ... Forcing feed_1/article_1/index.html into XHTML namespace Parsing index.html ... Forcing index.html into XHTML namespace Parsing feed_1/index.html ... Initial parse failed: Traceback (most recent call last): File "site-packages\calibre\ebooks\oeb\base.py", line 816, in first_pass File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48270) File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:71812) File "parser.pxi", line 1417, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:70608) File "parser.pxi", line 898, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:67148) File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:63824) File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745) File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64088) XMLSyntaxError: Opening and ending tag mismatch: hr line 29 and div, line 30, column 7 Parsing file 'feed_1/index.html' as HTML Forcing feed_1/index.html into XHTML namespace Referenced file u'/tmc/i/newsmap/ajax_indicator.gif' not found Referenced file u'/tmc/i/newsmap/pixel_off.gif' not found Referenced file u'feed_0/article_0/stylesheets/i/msn/hp4_channels_top_bg.gif' not found Referenced file u'/tmc/i/newsmap/right_off.gif' not found Referenced file u'/tmc/i/newsmap/tp_left.gif' not found Referenced file u'/tmc/i/tags/close_off.gif' not found Referenced file u'/tmc/i/marketing/hakrishim/bgr_text_main.gif' not found Referenced file u'/tmc/i/newsmap/tp_pixel.gif' not found Referenced file u'feed_0/article_0/stylesheets/i/msn/hp4_channels_bottom_bg.gif' not found Referenced file u'/tmc/i/tags/pixel_off.gif' not found Referenced file u'/tmc/i/newsmap/left_on.gif' not found Referenced file u'/tmc/i/c/greyDot.gif' not found Referenced file u'/tmc/i/tags/bg2.gif' not found Referenced file u'/tmc/i/article/indicator_medium.gif' not found Referenced file u'feed_0/article_1/stylesheets/i/msn/hp4_channels_bottom_bg.gif' not found Referenced file u'/tmc/i/dollar/back.jpg' not found Referenced file u'/tmc/i/tags/right_off.gif' not found Referenced file u'/tmc/i/newsmap/left_off.gif' not found Referenced file u'/tmc/i/marketing/hakrishim/bgr_text_Krishim.gif' not found Referenced file u'/tmc/i/newsmap/tp_right.gif' not found Referenced file u'/tmc/i/newsmap/pixel_on.gif' not found Referenced file u'/tmc/i/tags/pixel_on.gif' not found Referenced file u'feed_0/article_1/stylesheets/i/msn/hp4_top_search_bg.gif' not found Referenced file u'/tmc/i/tags/close_on.gif' not found Referenced file u'/tmc/i/tags/right_on.gif' not found Referenced file u'/tmc/i/tags/left_on.gif' not found Referenced file u'/tmc/i/newsmap/right_on.gif' not found Referenced file u'/tmc/i/marketing/forecast/text_box.gif' not found Referenced file u'feed_0/article_0/stylesheets/i/msn/hp4_top_search_bg.gif' not found Referenced file 'feed_2/index.html' not found Referenced file u'/tmc/i/tags/left_off.gif' not found Referenced file u'feed_0/article_1/stylesheets/i/msn/hp4_channels_top_bg.gif' not found Referenced file u'/tmc/i/tags/bg_footer1.gif' not found Reading TOC from NCX... 34% Running transforms on ebook... Merging user specified metadata... Detecting structure... if not, pleas tell me where to look. and thank you starson for the help so far. i think this message was posted in orderly fation

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Custom column read ?	pchrist7	Calibre	2	10-04-2010 03:52 AM
Archive for custom screensavers	sleeplessdave	Amazon Kindle	1	07-07-2010 01:33 PM
How to back up preferences and custom recipes?	greenapple	Calibre	3	03-29-2010 06:08 AM
Donations for Custom Recipes	ddavtian	Calibre	5	01-23-2010 05:54 PM
Help understanding custom recipes	andersent	Calibre	0	12-17-2009 03:37 PM

09-17-2010, 07:13 PM	#2749
kovidgoyal creator of calibre Posts: 45,621 Karma: 28549046 Join Date: Oct 2006 Location: Mumbai, India Device: Various	@krunk: append .recipe to the title

09-17-2010, 10:04 PM	#2752
bhandarisaurabh Enthusiast Posts: 49 Karma: 10 Join Date: Aug 2009 Device: none	hi can anyone help me with the recipe for industry week http://www.industryweek.com/Archive.aspx

09-18-2010, 11:57 AM	#2757
SilentSeven Enthusiast Posts: 27 Karma: 10 Join Date: Sep 2010 Device: Nexus7	Wondering if someone might be able to review the current Seattle Times recipe? Current version picks up a lot of articles more than once. It would also be fantastic if the output could have a table of contents type organization section built around the top darkblue rollover bar. Wish I could this myself...

Advert

Advert