Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Closed Thread
 
Thread Tools Search this Thread
Old 06-21-2010, 10:17 AM   #2176
robandcurtis
Junior Member
robandcurtis began at the beginning.
 
Posts: 5
Karma: 12
Join Date: Jun 2010
Device: Kobo
Quote:
Originally Posted by rty View Post
Here it is. Recipe for London Free Press (Canada).
Hey that was fast. Works like a charm.
robandcurtis is offline  
Old 06-21-2010, 12:04 PM   #2177
rty
Zealot
rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.
 
Posts: 108
Karma: 6066
Join Date: Apr 2010
Location: Singapore
Device: iPad Air, Kindle DXG, Kindle Paperwhite
Recipe for People's Daily (in Chinese)
Attached Files
File Type: zip PeopleDaily.zip (929 Bytes, 166 views)

Last edited by rty; 06-22-2010 at 01:15 PM.
rty is offline  
Advert
Old 06-21-2010, 12:11 PM   #2178
rty
Zealot
rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.
 
Posts: 108
Karma: 6066
Join Date: Apr 2010
Location: Singapore
Device: iPad Air, Kindle DXG, Kindle Paperwhite
Quote:
Originally Posted by Starson17 View Post
you still haven't used append_page. Add preprocess_html the way that it's used in AG.

Help Starson ....please. Another multipage issue. I encountered another website that has multipage articles and the next page is linked via an image (button image) as follows:

Code:
<a href="/GB/1027/11928295.html">
<img src="/img/next_b.gif" border="0">
</a>
Please look at the codes below (click on the Show button) that I modified from AG to combine the pages.

Here I was trying to find the image having src='/img/next_b.gif' and then grab the href for the URL but it doesn't seem to work. What did I do wrong? Help please?

Spoiler:
Code:
    def append_page(self, soup, appendtag, position):
        pager = soup.find('img',attrs={'src':'/img/next_b.gif'})
        if pager:
           nexturl = self.INDEX + pager.a['href']
           soup2 = self.index_to_soup(nexturl)
           texttag = soup2.find('div', attrs={'class':'left_content'})
           #for it in texttag.findAll(style=True):
           #   del it['style']
           newpos = len(texttag.contents)          
           self.append_page(soup2,texttag,newpos)
           texttag.extract()
           appendtag.insert(position,texttag)
        
    
    def preprocess_html(self, soup): 
        mtag = '<meta http-equiv="content-type" content="text/html;charset=GB2312" />\n<meta http-equiv="content-language" content="utf-8" />'
        soup.head.insert(0,mtag)    
        for item in soup.findAll(style=True):
            del item['form']
        self.append_page(soup, soup.body, 3)
        #pager = soup.find('a',attrs={'class':'ab12'})
        #if pager:
        #   pager.extract()        
        return soup

Last edited by rty; 06-21-2010 at 12:21 PM.
rty is offline  
Old 06-21-2010, 01:04 PM   #2179
rford
Junior Member
rford began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Jun 2010
Device: kobo
Thumbs down Rotating Images.

I have a custom recipe to download all my favorite comic strips. Similar to the xkcd.recipe.

The one thing I found annoying was the image in the epub were too wide and were getting cut off. So I rotated them. Now long 3 and 4 panel strips are landscape.

here is the code snippet that I used to rotate the images. Hopefully others will find it useful.
Code:
import calibre.utils.PythonMagickWand as pw
Code:
    def postprocess_html(self, soup, first):
        #process all the images. assumes that the new html has the correct path
        for tag in soup.findAll(lambda tag: tag.name.lower()=='img' and tag.has_key('src')):
            iurl = tag['src']
            print 'resizing image' + iurl
            with pw.ImageMagick():
                img = pw.NewMagickWand()
                p = pw.NewPixelWand()
                if img < 0:
                    raise RuntimeError('Out of memory')
                if not pw.MagickReadImage(img, iurl):
                    severity = pw.ExceptionType(0)
                    msg = pw.MagickGetException(img, byref(severity))
                    raise IOError('Failed to read image from: %s: %s'
                        %(iurl, msg))
                
                width = pw.MagickGetImageWidth(img)
                height = pw.MagickGetImageHeight(img)

                if( width > height ) :
                    print 'Rotate image'
                    pw.MagickRotateImage(img, p, 90)

                if not pw.MagickWriteImage(img, iurl):
                    raise RuntimeError('Failed to save image to %s'%iurl)
                pw.DestroyMagickWand(img)


        return soup
rford is offline  
Old 06-21-2010, 02:28 PM   #2180
gambarini
Connoisseur
gambarini began at the beginning.
 
Posts: 98
Karma: 22
Join Date: Mar 2010
Device: IRiver Story, Ipod Touch, Android SmartPhone
http://www.ilsole24ore.com/rss/primapagina.xml

Any ideas with this feed?
The correct link is not under "guid", nor "link" or "links" tag.
gambarini is offline  
Advert
Old 06-21-2010, 03:30 PM   #2181
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by rford View Post
here is the code snippet that I used to rotate the images. Hopefully others will find it useful.
Thanks! I don't want to rotate images, but I have cases where I'd like to compare image height to width. This will be useful.
Starson17 is offline  
Old 06-21-2010, 09:50 PM   #2182
bhandarisaurabh
Enthusiast
bhandarisaurabh began at the beginning.
 
Posts: 49
Karma: 10
Join Date: Aug 2009
Device: none
Quote:
Originally Posted by rty View Post
Look at the RSS page provided by Forbes India: http://business.in.com/rss/

As I mentioned, the recipe picks up articles from the feed called "Complete Business.in.com" http://business.in.com/rssfeed/rss_all.xml

Anything that is not included by Forbes India in this particular feed, there's nothing I can do about it. Maybe you can write to Forbes India to ask them to include all the articles of the latest issue in the RSS feed page and see if they care.
OKIE THANKS FOR THE HELP
bhandarisaurabh is offline  
Old 06-22-2010, 10:19 AM   #2183
mlstein
Enthusiast
mlstein knows what time it ismlstein knows what time it ismlstein knows what time it ismlstein knows what time it ismlstein knows what time it ismlstein knows what time it ismlstein knows what time it ismlstein knows what time it ismlstein knows what time it ismlstein knows what time it ismlstein knows what time it is
 
Posts: 49
Karma: 2062
Join Date: May 2010
Device: iPad (one)
A second request for subscriber content for the London Review of books, http://www.lrb.co.uk. Anyone?
mlstein is offline  
Old 06-22-2010, 02:22 PM   #2184
gambarini
Connoisseur
gambarini began at the beginning.
 
Posts: 98
Karma: 22
Join Date: Mar 2010
Device: IRiver Story, Ipod Touch, Android SmartPhone
With this feed i have tried two ways, and every one has is pro and cons...

With get.article i can obtain the correct link, but i can't find the title of the article.
With the parse_index ( index_to_soup) i can find the correct "title" but i don't get the link (in the soup there is a malformed "link" tag)
an example of index to soup
Spoiler:

Code:
<item>
<title><![CDATA[Berlusconi: "Siamo il Paese 
più ricco d'Europa"]]></title>
<description><![CDATA[ROMA<BR>Il Premier Silvio Berlusconi continua a confidare su un forte consenso popolare alla sua persona e al suo governo, a dispetto «di tutto il fango che ci buttano addosso». E inivita il centrodestra a «non farsi del male in casa», apprpoffitando semmai di una opposizione che descrive pressochè inesistente. «Nonostante tutto il fango che tentano di buttarci addosso - dice nel suo collegamento  ...(continua)]]></description>
<author><![CDATA[ ]]></author>
<category><![CDATA[POLITICA]]></category>
<pubdate><![CDATA[Sun, 20 Jun 2010 13:34:37 +0200]]></pubdate>
<link />http://www.lastampa.it/redazione/cmsSezioni/politica/201006articoli/56066girata.asp
<enclosure url="http://www.lastampa.it/redazione/cmssezioni/politica/201006images/berlusconi01g.jpg" type="image/jpeg">
<image>
<url>http://www.lastampa.it/redazione/cmssezioni/politica/201006images/berlusconi01g.jpg</url>
<title></title>
<link />
<width></width>
<height></height>
</image>

So is there the possibility to use both solutions together?
Or is there the possibility to extract the link near the malformet tag <link /> ???


p.s.

probably the bug is related to the feed
Spoiler:

Code:
Parsing index.html ...
Initial parse failed:
Traceback (most recent call last):
  File "site-packages\calibre\ebooks\oeb\base.py", line 813, in first_pass
  File "lxml.etree.pyx", line 2538, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48266)
  File "parser.pxi", line 1536, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:71653)
  File "parser.pxi", line 1408, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:70449)
  File "parser.pxi", line 898, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:67144)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:63820)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64741)
  File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64084)
XMLSyntaxError: Opening and ending tag mismatch: img line 29 and p, line 29, column 27

Last edited by gambarini; 06-22-2010 at 02:36 PM.
gambarini is offline  
Old 06-22-2010, 02:46 PM   #2185
rty
Zealot
rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.
 
Posts: 108
Karma: 6066
Join Date: Apr 2010
Location: Singapore
Device: iPad Air, Kindle DXG, Kindle Paperwhite
Recipe for China Press USA (in Chinese)

Tested OK on B&N Nook.
Attached Files
File Type: zip ChinaPress.zip (1.2 KB, 192 views)
rty is offline  
Old 06-22-2010, 03:12 PM   #2186
gambarini
Connoisseur
gambarini began at the beginning.
 
Posts: 98
Karma: 22
Join Date: Mar 2010
Device: IRiver Story, Ipod Touch, Android SmartPhone
Quote:
Originally Posted by gambarini View Post
With this feed i have tried two ways, and every one has is pro and cons...

With get.article i can obtain the correct link, but i can't find the title of the article.
With the parse_index ( index_to_soup) i can find the correct "title" but i don't get the link (in the soup there is a malformed "link" tag)
an example of index to soup
Spoiler:

Code:
<item>
<title><![CDATA[Berlusconi: "Siamo il Paese 
più ricco d'Europa"]]></title>
<description><![CDATA[ROMA<BR>Il Premier Silvio Berlusconi continua a confidare su un forte consenso popolare alla sua persona e al suo governo, a dispetto «di tutto il fango che ci buttano addosso». E inivita il centrodestra a «non farsi del male in casa», apprpoffitando semmai di una opposizione che descrive pressochè inesistente. «Nonostante tutto il fango che tentano di buttarci addosso - dice nel suo collegamento  ...(continua)]]></description>
<author><![CDATA[ ]]></author>
<category><![CDATA[POLITICA]]></category>
<pubdate><![CDATA[Sun, 20 Jun 2010 13:34:37 +0200]]></pubdate>
<link />http://www.lastampa.it/redazione/cmsSezioni/politica/201006articoli/56066girata.asp
<enclosure url="http://www.lastampa.it/redazione/cmssezioni/politica/201006images/berlusconi01g.jpg" type="image/jpeg">
<image>
<url>http://www.lastampa.it/redazione/cmssezioni/politica/201006images/berlusconi01g.jpg</url>
<title></title>
<link />
<width></width>
<height></height>
</image>

So is there the possibility to use both solutions together?
Or is there the possibility to extract the link near the malformet tag <link /> ???


p.s.

probably the bug is related to the feed
Spoiler:

Code:
Parsing index.html ...
Initial parse failed:
Traceback (most recent call last):
  File "site-packages\calibre\ebooks\oeb\base.py", line 813, in first_pass
  File "lxml.etree.pyx", line 2538, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48266)
  File "parser.pxi", line 1536, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:71653)
  File "parser.pxi", line 1408, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:70449)
  File "parser.pxi", line 898, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:67144)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:63820)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64741)
  File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64084)
XMLSyntaxError: Opening and ending tag mismatch: img line 29 and p, line 29, column 27
ok, this is my solution; i don't use the feed but i try to obtain link directly from the html section of the site.
So this is the code (beta version )

Spoiler:

Code:
 def parse_index(self):
    feeds = []
    for title, url in [
             ("Politica", "http://www.lastampa.it/_web/CMSTP/tmplSezioni/POLITICA/politicaHP.asp")
            ]:

            soup = self.index_to_soup(url)
            soup = soup.find(attrs={'class':'sezione'})

            articles = []

            for article in soup.findAllNext(attrs={'class':'titolo'}):
                title_url = self.tag_to_string(article)
                link = article.get('href', False)
                
                date = ''
                description = ''
#                link = article.link
#                link = article.find ('link />')
                     
                if title_url:
                   articles.append({'title': title_url, 'url': link,'description':'', 'date':date}),


            if articles:
               feeds.append((title, articles))

            return feeds
gambarini is offline  
Old 06-23-2010, 12:12 PM   #2187
rty
Zealot
rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.rty got an A in P-Chem.
 
Posts: 108
Karma: 6066
Join Date: Apr 2010
Location: Singapore
Device: iPad Air, Kindle DXG, Kindle Paperwhite
Recipe for ifzm China Southern Weekly (in Chinese)

Tested OK on B&N Nook
Attached Files
File Type: zip ifzm - China Southern Weekly.zip (977 Bytes, 210 views)
rty is offline  
Old 06-23-2010, 12:40 PM   #2188
kiklop74
Guru
kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.
 
kiklop74's Avatar
 
Posts: 800
Karma: 194644
Join Date: Dec 2007
Location: Argentina
Device: Kindle Voyage
Quote:
Originally Posted by mlstein View Post
A second request for subscriber content for the London Review of books, http://www.lrb.co.uk. Anyone?
Will be included in the next release of calibre
kiklop74 is offline  
Old 06-23-2010, 08:09 PM   #2189
nook.life
Member
nook.life began at the beginning.
 
Posts: 12
Karma: 10
Join Date: May 2010
Device: Nook
Associated Press Broken

Anyone else notice that the AP recipe has been broken for some time now for the Nook?

It only shows the table of contents with the article summaries, but when you go to the specific article, all you get is a banner ad, a newspaper header and ad images but no article. In other articles you get some crazy coding like

"#lightbox{position:absolute; top:40px; left:0 width:100%; z-index: 100; text-align:center; line height:0;} #lightbox{position:absolute; top:40px; left:0 width:100%; z-index: 100; text-align:center; line height:0;} #lightbox a img {border:none;} #outerImageContainer{ position: relative; background-color: #fff; width: 250px; height 250px; margin:0 auto;} #imageContainer{padding:10px;}" ...etc etc

all this random code is what composes the articles.

In other articles, you get actual text, but it is cut off, only showing half a page. Changing the font size does not matter and it still cuts off text mid sentence?

Anyone know what's going on? All the other recipes that I use ever day are normal...
nook.life is offline  
Old 06-23-2010, 08:41 PM   #2190
nook.life
Member
nook.life began at the beginning.
 
Posts: 12
Karma: 10
Join Date: May 2010
Device: Nook
Quote:
Originally Posted by Starson17 View Post
Try this:
Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup

class Explosm(BasicNewsRecipe):
    title               = 'Explosm'
    __author__          = 'Starson17'
    description         = 'Explosm'
    language            = 'en'
    use_embedded_content= False
    no_stylesheets      = True
    linearize_tables      = True
    oldest_article      = 24
    remove_javascript   = True
    remove_empty_feeds    = True
    max_articles_per_feed = 10

    feeds = [
             (u'Explosm Feed', u'http://feeds.feedburner.com/Explosm')
             ]

    def get_article_url(self, article):
        return article.get('link', None)

    keep_only_tags     = [dict(name='div', attrs={'id':'maincontent'})]

    def preprocess_html(self, soup):
        table_tags = soup.findAll('table')
        table_tags[1].extract() 
        NavTag = soup.find(text='&laquo; First') 
        NavTag.parent.parent.extract()
        return soup

    extra_css = '''
                    h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
                    h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
                    p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
                    body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
		'''
Quote:
Originally Posted by Starson17 View Post
I took a look at it. I told you I took a look at it. I asked you a question. You didn't respond, so I stopped. I like to know there's really someone out there.
Wow don't I feel stupid. I searched the replies for a response before posting the message, but somehow missed it. Even now I had to do a google search on the forum to find it. Thank you so, so much for looking into this recipe and taking the time to help me out. I really appreciate it. In answer to your question, yes i looked through those first, but it was not offered.

I tried the recipe out and it almost works. Unfortunately, the cartoon gets cut in half. Please see attached pic. Perhaps blending in rford's code above for rotating cartoons would work. I replaced his code with yours starting at def postprocess_html, but the recipe did not work at all (clearly it could not have been that easy, although I figured i'd try)

Thanks again for your help and sorry once again for not fully searching the forum before asking for the request again. THANK YOUUUUU

http://picturepush.com/public/3679162

Last edited by nook.life; 06-24-2010 at 01:21 PM.
nook.life is offline  
Closed Thread


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Custom column read ? pchrist7 Calibre 2 10-04-2010 02:52 AM
Archive for custom screensavers sleeplessdave Amazon Kindle 1 07-07-2010 12:33 PM
How to back up preferences and custom recipes? greenapple Calibre 3 03-29-2010 05:08 AM
Donations for Custom Recipes ddavtian Calibre 5 01-23-2010 04:54 PM
Help understanding custom recipes andersent Calibre 0 12-17-2009 02:37 PM


All times are GMT -4. The time now is 03:22 AM.


MobileRead.com is a privately owned, operated and funded community.