Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Closed Thread
 
Thread Tools Search this Thread
Old 09-17-2010, 05:09 PM   #2746
Flexicat
Junior Member
Flexicat began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Aug 2010
Device: Kobo
Hello. Can someone give me some assistance in creating a recipe for a site that does not have an RSS feed?

The base url is "http://archiveofourown.org/tags/Sherlock%20(TV)/works" but the actual story titles seem to be located within HTML code that looks like this on the page;

Code:
  <!--title, author, fandom-->
    <div class="header module">
      <h4 title="title">
  	    <a href="/works/117685">Disorder</a>
   		  by
        <!-- do not cache -->
      </h4>
As a result, I cannot figure out how to extract the article ID number for use. I am guessing that I will have to parse the HTML code of the page but have never done that type of extraction before. I am not familiar with Python or Beautiful Soup.

Thanks.
Flexicat is offline  
Old 09-17-2010, 05:26 PM   #2747
krunk
Member
krunk began at the beginning.
 
Posts: 19
Karma: 10
Join Date: Feb 2010
Location: Los Angeles, CA
Device: Kindle 3
I'd like to set calibre up on a cron schedule for automatic downloading without having to start up the application.

There are some stock recipes that I use to pull down news feeds and such. I noticed with the command line utility ebook-convert, you can get a list of built-in recipes.

But how do you use a built in recipe from the list?

I tried:

Code:
ebook-convert "New York Times Top Stories" nyt.mobi
But it appears ebook-convert only takes files as args. Do I need to do something like go to the Customize menu, "Customize builtin recipe", then copy and paste from the advanced into a *.recipe file, and call it that way?
krunk is offline  
Advert
Old 09-17-2010, 06:07 PM   #2748
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire

can someone help me with this regexp please? I suck at regex
Trying to find (Today's Nuze)
i tried the following with no success
Code:
preprocess_regexps = [
                          (re.compile(r'(Today)(.).*?(\\s+)(Nuze), re.DOTALL|re.IGNORECASE),lambda match: '')]
thanks

Last edited by TonytheBookworm; 09-17-2010 at 06:14 PM. Reason: figured it out
TonytheBookworm is offline  
Old 09-17-2010, 06:13 PM   #2749
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,382
Karma: 27756918
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
@krunk: append .recipe to the title
kovidgoyal is offline  
Old 09-17-2010, 06:22 PM   #2750
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Neal Boortz Nuze

here is the recipe for Nealz Nuze. The show notes for the National Syndicated (U.S.) talk show host Neal Boortz.
Attached Files
File Type: rar boortz.rar (1.3 KB, 316 views)

Last edited by TonytheBookworm; 09-17-2010 at 06:36 PM. Reason: fixed recipe
TonytheBookworm is offline  
Advert
Old 09-17-2010, 08:47 PM   #2751
burbank_atl
Junior Member
burbank_atl began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Sep 2010
Device: Nook
Python Env

Quote:
Originally Posted by Starson17 View Post
I did. It isn't what I'm looking for.

I would like to use the standalone Python environment that would allow me to use the GUI. Unless I missed something, Calibre-debug would only load a CLI.

For me the real solution would be to find docs/info on installing the Calibre specific packages into the site-packages for Python. Then I can play with the recipes as just standalone python procedures.
burbank_atl is offline  
Old 09-17-2010, 09:04 PM   #2752
bhandarisaurabh
Enthusiast
bhandarisaurabh began at the beginning.
 
Posts: 49
Karma: 10
Join Date: Aug 2009
Device: none
hi can anyone help me with the recipe for industry week
http://www.industryweek.com/Archive.aspx
bhandarisaurabh is offline  
Old 09-17-2010, 09:29 PM   #2753
krunk
Member
krunk began at the beginning.
 
Posts: 19
Karma: 10
Join Date: Feb 2010
Location: Los Angeles, CA
Device: Kindle 3
Quote:
Originally Posted by kovidgoyal View Post
@krunk: append .recipe to the title
Thank you kovid!
krunk is offline  
Old 09-17-2010, 10:43 PM   #2754
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:
Originally Posted by bhandarisaurabh View Post
hi can anyone help me with the recipe for industry week
http://www.industryweek.com/Archive.aspx
This should work for the CURRENT ARTICLE MONTH/YEAR
It has a form that you select the different year but I'm not sure what the actual true urls are that it uses on that. So I just stuck with the current month year since I figured that is what you would want anyway. If you will look even though September 2010 is selected on the page the article content still says August 18 or whatever. That is the same date that is on the original page.

Anyway the only thing that I don't understand how to do is get the description to drop the text that is inside the <a>. Once that is done I will post an update.


Updated Code to do descr correctly
Spoiler:

Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, re
class IW(BasicNewsRecipe):
    title      = 'Industry Week'
    __author__ = 'Tonythebookworm'
    description = ''
    language = 'en'
    no_stylesheets = True
    publisher           = 'Tonythebookworm'
    category            = 'Manufactoring'
    use_embedded_content= False
    no_stylesheets      = True
    oldest_article      = 40
    remove_javascript   = True
    remove_empty_feeds  = True
    
    max_articles_per_feed = 200 # only gets the first 200 articles
    INDEX = 'http://www.industryweek.com'
    
    
    
    remove_tags = [dict(name='div', attrs={'class':['crumbNav']}),
                   dict(name='i')]
    
    def parse_index(self):
        feeds = []
        for title, url in [
                            (u"Current Month", u"http://www.industryweek.com/Archive.aspx"),
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds
        
    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        soup = self.index_to_soup(url)
        
        for item in soup.findAll('a', attrs={'class':'article'}):
         
         link = item['href']
         soup = self.index_to_soup(url)    
         if link:
         
          url         = self.INDEX + link
          title       = self.tag_to_string(item)
          descr    = item.parent
          item.extract()
          descr       = self.tag_to_string(descr)
          #print 'the url is: ', url
          #print 'the title is: ', title
          #print 'the descr is: ', descr
          current_articles.append({'title': title, 'url': url, 'description': descr, 'date':''}) # append all this
        return current_articles
      

   
    def print_version(self, url):
        split1 = url.split("=")
        print_url = 'http://www.industryweek.com/PrintArticle.aspx?ArticleID=' + split1[1]
        
        return print_url

Last edited by TonytheBookworm; 09-18-2010 at 01:54 PM. Reason: modified code (considered complete)
TonytheBookworm is offline  
Old 09-17-2010, 11:42 PM   #2755
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
If you have something like this:
Spoiler:

Code:
<p>
<a class="article" href="/ReadArticle.aspx?ArticleID=22529">Identifying Your Future Leaders</a>
THIS IS SOME PRETTY TEXT I WOULD LIKE TO CAPTURE.
</p>


how would you go about getting the text inside the p?

I tried doing something like this but ended up getting the text inside the a tag as well.

Spoiler:

Code:
for item in soup.findAll('a', attrs={'class':'article'}):
        
         link = item['href']
         
         soup = self.index_to_soup(url)    
         if link:
         
          url         = self.INDEX + link
          title       = self.tag_to_string(item)
          descr       = self.tag_to_string(item.parent) #question about this
TonytheBookworm is offline  
Old 09-18-2010, 08:43 AM   #2756
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
If you have something like this:
Spoiler:

Code:
<p>
<a class="article" href="/ReadArticle.aspx?ArticleID=22529">Identifying Your Future Leaders</a>
THIS IS SOME PRETTY TEXT I WOULD LIKE TO CAPTURE.
</p>

how would you go about getting the text inside the p?
Without testing, my thoughts would be:
1) grab the parent of item (<p>) and extract() item, leaving a p with self.tag_to_string of what you want, or
2) perhaps, just grab item.next.next.next

I always need to test to be sure, but one of those should work.
edit: I see your post above, I tested it and they both work.
This worked best:
Code:
          descr       = item.parent
          item.extract()
          descr       = self.tag_to_string(descr)

Last edited by Starson17; 09-18-2010 at 10:07 AM.
Starson17 is offline  
Old 09-18-2010, 10:57 AM   #2757
SilentSeven
Enthusiast
SilentSeven began at the beginning.
 
Posts: 27
Karma: 10
Join Date: Sep 2010
Device: Nexus7
Wondering if someone might be able to review the current Seattle Times recipe?

Current version picks up a lot of articles more than once.

It would also be fantastic if the output could have a table of contents type organization section built around the top darkblue rollover bar.

Wish I could this myself...
SilentSeven is offline  
Old 09-18-2010, 01:44 PM   #2758
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:

Wish I could this myself..
Give it a shot, it is not that hard to pick up. Once you get the basics down pat then it becomes pretty simple to do. And they are many good people here that will help you along the way. So take and read this thread and you will get an idea of how to do it. Heck just read the communications between starson17 and myself starting with the AJC recipe (my first one) then you will see how it is done to the most part. I think i have asked and starson17 has answered pretty much 90% or better of the issues one would face while trying to do a recipe.

Good luck and let us know if we can help. But seriously give it a shot and you will be amazed of how quick you pick it up.
TonytheBookworm is offline  
Old 09-18-2010, 01:52 PM   #2759
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:
Originally Posted by Starson17 View Post
Without testing, my thoughts would be:
1) grab the parent of item (<p>) and extract() item, leaving a p with self.tag_to_string of what you want, or
2) perhaps, just grab item.next.next.next

I always need to test to be sure, but one of those should work.
edit: I see your post above, I tested it and they both work.
This worked best:
Code:
          descr       = item.parent
          item.extract()
          descr       = self.tag_to_string(descr)
worked great thank you.
TonytheBookworm is offline  
Old 09-18-2010, 08:40 PM   #2760
marbs
Zealot
marbs began at the beginning.
 
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
i am having trouble with my recipe:
Spoiler:
Code:
class AdvancedUserRecipe1283848012(BasicNewsRecipe):
    description   = 'TheMarker'
    cover_url      = 'http://static.ispot.co.il/wp-content/upload/2009/09/themarker.jpg'
    title          = u'The Marker1'
    language       = 'he'
    simultaneous_downloads = 1
    delay                  = 6   
    remove_javascript     = True
    timefmt        = '[%a, %d %b, %Y]'
    oldest_article = 1
    remove_tags = [dict(name='tr', attrs={'bgcolor':['#738A94']})          ]
    max_articles_per_feed = 1000
    extra_css='body{direction: rtl;} .article_description{direction: rtl; } a.article{direction: rtl; } .calibre_feed_description{direction: rtl; }'
    feeds          = [(u'Head Lines', u'http://www.themarker.com/tmc/content/xml/rss/hpfeed.xml'), (u'TA Market', u'http://www.themarker.com/tmc/content/xml/rss/sections/marketfeed.xml'), (u'Real Estate', u'http://www.themarker.com/tmc/content/xml/rss/sections/realEstaterfeed.xml'), (u'Wall Street & Global', u'http://www.themarker.com/tmc/content/xml/rss/sections/wallsfeed.xml'), (u'Law', u'http://www.themarker.com/tmc/content/xml/rss/sections/lawfeed.xml'), (u'Media', u'http://www.themarker.com/tmc/content/xml/rss/sections/mediafeed.xml'), (u'Consumer', u'http://www.themarker.com/tmc/content/xml/rss/sections/consumerfeed.xml'), (u'Career', u'http://www.themarker.com/tmc/content/xml/rss/sections/careerfeed.xml'), (u'Car', u'http://www.themarker.com/tmc/content/xml/rss/sections/carfeed.xml'), (u'High Tech', u'http://www.themarker.com/tmc/content/xml/rss/sections/hightechfeed.xml'), (u'Investor Guide', u'http://www.themarker.com/tmc/content/xml/rss/sections/investorGuidefeed.xml')]
    def print_version(self, url):
       baseURL=url.replace('tmc/article.jhtml?ElementId=', 'ibo/misc/printFriendly.jhtml?ElementId=%2Fibo%2Frepositories%2Fstories%2Fm1_2000%2F')
       s= baseURL + '.xml'
       return s


I ran ebook-convert and I think this is the relevant output:

Spoiler:

Parsing file 'feed_0/index.html' as HTML
Forcing feed_0/index.html into XHTML namespace
Parsing feed_1/article_0/index.html ...
Forcing feed_1/article_0/index.html into XHTML namespace
Parsing feed_1/article_1/index.html ...
Forcing feed_1/article_1/index.html into XHTML namespace
Parsing index.html ...
Forcing index.html into XHTML namespace
Parsing feed_1/index.html ...
Initial parse failed:
Traceback (most recent call last):
File "site-packages\calibre\ebooks\oeb\base.py", line 816, in first_pass
File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48270)
File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:71812)
File "parser.pxi", line 1417, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:70608)
File "parser.pxi", line 898, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:67148)
File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:63824)
File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64088)
XMLSyntaxError: Opening and ending tag mismatch: hr line 29 and div, line 30, column 7

Parsing file 'feed_1/index.html' as HTML
Forcing feed_1/index.html into XHTML namespace
Referenced file u'/tmc/i/newsmap/ajax_indicator.gif' not found
Referenced file u'/tmc/i/newsmap/pixel_off.gif' not found
Referenced file u'feed_0/article_0/stylesheets/i/msn/hp4_channels_top_bg.gif' not found
Referenced file u'/tmc/i/newsmap/right_off.gif' not found
Referenced file u'/tmc/i/newsmap/tp_left.gif' not found
Referenced file u'/tmc/i/tags/close_off.gif' not found
Referenced file u'/tmc/i/marketing/hakrishim/bgr_text_main.gif' not found
Referenced file u'/tmc/i/newsmap/tp_pixel.gif' not found
Referenced file u'feed_0/article_0/stylesheets/i/msn/hp4_channels_bottom_bg.gif' not found
Referenced file u'/tmc/i/tags/pixel_off.gif' not found
Referenced file u'/tmc/i/newsmap/left_on.gif' not found
Referenced file u'/tmc/i/c/greyDot.gif' not found
Referenced file u'/tmc/i/tags/bg2.gif' not found
Referenced file u'/tmc/i/article/indicator_medium.gif' not found
Referenced file u'feed_0/article_1/stylesheets/i/msn/hp4_channels_bottom_bg.gif' not found
Referenced file u'/tmc/i/dollar/back.jpg' not found
Referenced file u'/tmc/i/tags/right_off.gif' not found
Referenced file u'/tmc/i/newsmap/left_off.gif' not found
Referenced file u'/tmc/i/marketing/hakrishim/bgr_text_Krishim.gif' not found
Referenced file u'/tmc/i/newsmap/tp_right.gif' not found
Referenced file u'/tmc/i/newsmap/pixel_on.gif' not found
Referenced file u'/tmc/i/tags/pixel_on.gif' not found
Referenced file u'feed_0/article_1/stylesheets/i/msn/hp4_top_search_bg.gif' not found
Referenced file u'/tmc/i/tags/close_on.gif' not found
Referenced file u'/tmc/i/tags/right_on.gif' not found
Referenced file u'/tmc/i/tags/left_on.gif' not found
Referenced file u'/tmc/i/newsmap/right_on.gif' not found
Referenced file u'/tmc/i/marketing/forecast/text_box.gif' not found
Referenced file u'feed_0/article_0/stylesheets/i/msn/hp4_top_search_bg.gif' not found
Referenced file 'feed_2/index.html' not found
Referenced file u'/tmc/i/tags/left_off.gif' not found
Referenced file u'feed_0/article_1/stylesheets/i/msn/hp4_channels_top_bg.gif' not found
Referenced file u'/tmc/i/tags/bg_footer1.gif' not found
Reading TOC from NCX...
34% Running transforms on ebook...
Merging user specified metadata...
Detecting structure...


if not, pleas tell me where to look.
and thank you starson for the help so far. i think this message was posted in orderly fation
marbs is offline  
Closed Thread


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Custom column read ? pchrist7 Calibre 2 10-04-2010 02:52 AM
Archive for custom screensavers sleeplessdave Amazon Kindle 1 07-07-2010 12:33 PM
How to back up preferences and custom recipes? greenapple Calibre 3 03-29-2010 05:08 AM
Donations for Custom Recipes ddavtian Calibre 5 01-23-2010 04:54 PM
Help understanding custom recipes andersent Calibre 0 12-17-2009 02:37 PM


All times are GMT -4. The time now is 04:31 PM.


MobileRead.com is a privately owned, operated and funded community.