Calibre recipes - Page 2

ddavtian · 06-12-2008, 09:12 PM

This is based on published WSJ profile.
I had pm'ed you my login name and password, feel free to use it for testing/reading.

PHP Code:


			
##    Copyright (C) 2008 Kovid Goyal kovid@kovidgoyal.net

##    This program is free software; you can redistribute it and/or modify

##    it under the terms of the GNU General Public License as published by

##    the Free Software Foundation; either version 2 of the License, or

##    (at your option) any later version.

##

##    This program is distributed in the hope that it will be useful,

##    but WITHOUT ANY WARRANTY; without even the implied warranty of

##    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

##    GNU General Public License for more details.

##

##    You should have received a copy of the GNU General Public License along

##    with this program; if not, write to the Free Software Foundation, Inc.,

##    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

 



import time

import re

## from libprs500.ebooks.lrf.web.profiles import DefaultProfile

## from libprs500.ebooks.BeautifulSoup import BeautifulSoup

from calibre.web.feeds.news import BasicNewsRecipe

from calibre.ebooks.lrf.web.profiles import DefaultProfile

from calibre.ebooks.BeautifulSoup import BeautifulSoup



class WallStreetJournalPaper(BasicNewsRecipe): 

    import time

    import re

    from calibre.web.feeds.news import BasicNewsRecipe

    from calibre.ebooks.lrf.web.profiles import DefaultProfile

    from calibre.ebooks.BeautifulSoup import BeautifulSoup

    

    title = 'Wall Street Print Edition' 

    __author__ = 'Kovid Goyal'

    simultaneous_downloads = 1    

    max_articles_per_feed = 200

    INDEX = 'http://online.wsj.com/page/2_0133.html'

    timefmt  = ' [%a, %b %d, %Y]' 

    no_stylesheets = False

    html2lrf_options = [('--ignore-tables')]

    issue_date = time.ctime()

    print issue_date









    ## Don't grab articles more than 7 days old 

    oldest_article = 7



    def get_browser(self): 

        br = DefaultProfile.get_browser() 

        if self.username is not None and self.password is not None: 

            br.open('http://online.wsj.com/login') 

            br.select_form(name='login_form') 

            br['user']   = self.username 

            br['password'] = self.password 

            br.submit() 

        return br 

   

    preprocess_regexps = [(re.compile(i[0], re.IGNORECASE | re.DOTALL), i[1]) for i in  

        [ 

        ## Remove anything before the body of the article. 

        (r'<body.*?<!-- article start', lambda match: '<body><!-- article start'), 

 

        ## Remove any insets from the body of the article. 

        (r'<div id="inset".*?</div>.?</div>.?<p', lambda match : '<p'), 

 

        ## Remove anything after the end of the article. 

        (r'<!-- article end.*?</body>', lambda match : '</body>'), 

        ] 

    ] 

 

 

     

    def parse_index(self):

        articles = []

            soup = self.index_to_soup(self.INDEX)

        issue_date = time.ctime()

        

        for item in soup.findAll('a', attrs={'class':'bold80'}):

            a = item.find('a')

            if a and a.has_key('href'):

                url = item['href']

                url = 'http://online.wsj.com'+url.replace('/article', '/article_print')

                title = self.tag_to_string(item)

                description = ''

                articles.append({

                    'title':title,

                    'date':date,

                    'url':url,

                    'description':description

                    })

               

    

        return {'Todays Paper' : articles }

kovidgoyal · 06-12-2008, 09:23 PM

Code:

return [('Todays newspaper', articles)]

Incindentally, how is the WSJ doing post murdoch?

ddavtian · 06-12-2008, 11:26 PM

I started reading it this year (being able to read on Sony was a big factor for me), so I cannot compare before-after.

ddavtian · 07-04-2008, 09:52 PM

Quote:

Originally Posted by kovidgoyal

post your recipe

Hi Kovid. Did you have a chance to look at this posted recipe? I understand if you do not have time to look at individual recipes.

Thanks for great software,
David

kovidgoyal · 07-05-2008, 01:39 PM

Your return statement should be:

Code:

return [('Today\'s Paper', articles)]

ddavtian · 07-06-2008, 12:33 AM

Quote:

Originally Posted by kovidgoyal

Your return statement should be:

Code:

return [('Today\'s Paper', articles)]

You had said this 3 weeks ago and I didn't get it then :-(

I tried it and got a new error:
Traceback (most recent call last):
File "convert_from.py", line 61, in <module>
File "convert_from.py", line 42, in main
File "calibre\web\feeds\main.pyo", line 128, in run_recipe
File "calibre\web\feeds\news.pyo", line 825, in __init__
File "calibre\ebooks\lrf\web\profiles\__init__.pyo" , line 174, in __init__
File "calibre\ebooks\lrf\web\profiles\__init__.pyo" , line 204, in build_index
AttributeError: 'list' object has no attribute 'keys'

I put few print statements to track the flow, it never gets into this loop:
for item in soup.findAll('a', attrs={'class':'bold80'}):

I checked the web page, nothing was changed there. Articles are identifed correctly. Here is a link from the source code:
<a class="bold80" href="/article/SB121521047990229423.html?mod=todays_us_page_one">

Kovid, your help is very much appreciated.
Thanks in advance.

kovidgoyal · 07-06-2008, 01:21 AM

Use the command feeds2lrf not web2lrf

ddavtian · 07-06-2008, 02:15 AM

Error is from feeds2lrf (I have 0.4.76 calibre):

C:\Temp\News>feeds2lrf --debug wsjNew.py --username=xxx --password=xxx
Fetching feeds...
Sat Jul 05 22:12:09 2008
Traceback (most recent call last):
File "convert_from.py", line 61, in <module>
File "convert_from.py", line 42, in main
File "calibre\web\feeds\main.pyo", line 128, in run_recipe
File "calibre\web\feeds\news.pyo", line 825, in __init__
File "calibre\ebooks\lrf\web\profiles\__init__.pyo" , line 174, in __init__
File "calibre\ebooks\lrf\web\profiles\__init__.pyo" , line 204, in build_index
AttributeError: 'list' object has no attribute 'keys'

kovidgoyal · 07-06-2008, 12:36 PM

Delete the line

Code:

from calibre.ebooks.lrf.web.profiles import DefaultProfile

ddavtian · 07-06-2008, 08:19 PM

The same error:

Sun Jul 06 16:14:26 2008
Traceback (most recent call last):
File "convert_from.py", line 61, in <module>
File "convert_from.py", line 42, in main
File "calibre\web\feeds\main.pyo", line 128, in run_recipe
File "calibre\web\feeds\news.pyo", line 825, in __init__
File "calibre\ebooks\lrf\web\profiles\__init__.pyo" , line 174, in __init__
File "calibre\ebooks\lrf\web\profiles\__init__.pyo" , line 204, in build_index
AttributeError: 'list' object has no attribute 'keys'

kovidgoyal · 07-07-2008, 03:08 PM

The attached recipe works for me with the command line

Code:

feeds2lrf test.py

Recipe:

Code:

##    Copyright (C) 2008 Kovid Goyal kovid@kovidgoyal.net
##    This program is free software; you can redistribute it and/or modify
##    it under the terms of the GNU General Public License as published by
##    the Free Software Foundation; either version 2 of the License, or
##    (at your option) any later version.
##
##    This program is distributed in the hope that it will be useful,
##    but WITHOUT ANY WARRANTY; without even the implied warranty of
##    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
##    GNU General Public License for more details.
##
##    You should have received a copy of the GNU General Public License along
##    with this program; if not, write to the Free Software Foundation, Inc.,
##    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
 

import time
import re
## from libprs500.ebooks.lrf.web.profiles import DefaultProfile
## from libprs500.ebooks.BeautifulSoup import BeautifulSoup
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup

class WallStreetJournalPaper(BasicNewsRecipe): 
    import time
    import re
    from calibre.web.feeds.news import BasicNewsRecipe
    from calibre.ebooks.lrf.web.profiles import DefaultProfile
    from calibre.ebooks.BeautifulSoup import BeautifulSoup
    
    title = 'Wall Street Print Edition' 
    __author__ = 'Kovid Goyal'
    simultaneous_downloads = 1    
    max_articles_per_feed = 200
    INDEX = 'http://online.wsj.com/page/2_0133.html'
    timefmt  = ' [%a, %b %d, %Y]' 
    no_stylesheets = False
    html2lrf_options = [('--ignore-tables')]
    issue_date = time.ctime()
    print issue_date




    ## Don't grab articles more than 7 days old 
    oldest_article = 7

    def get_browser(self): 
        br = DefaultProfile.get_browser() 
        if self.username is not None and self.password is not None: 
            br.open('http://online.wsj.com/login') 
            br.select_form(name='login_form') 
            br['user']   = self.username 
            br['password'] = self.password 
            br.submit() 
        return br 
   
    preprocess_regexps = [(re.compile(i[0], re.IGNORECASE | re.DOTALL), i[1]) for i in  
        [ 
        ## Remove anything before the body of the article. 
        (r'<body.*?<!-- article start', lambda match: '<body><!-- article start'), 
 
        ## Remove any insets from the body of the article. 
        (r'<div id="inset".*?</div>.?</div>.?<p', lambda match : '<p'), 
 
        ## Remove anything after the end of the article. 
        (r'<!-- article end.*?</body>', lambda match : '</body>'), 
        ] 
    ] 
 
 
     
    def parse_index(self):
        articles = []
        soup = self.index_to_soup(self.INDEX)
        issue_date = time.ctime()
        
        for item in soup.findAll('a', attrs={'class':'bold80'}):
            a = item.find('a')
            if a and a.has_key('href'):
                url = item['href']
                url = 'http://online.wsj.com'+url.replace('/article', '/article_print')
                title = self.tag_to_string(item)
                description = ''
                articles.append({
                    'title':title,
                    'date':date,
                    'url':url,
                    'description':description
                    })
               
    
        return [('Todays Paper', articles)]

ddavtian · 07-08-2008, 03:17 AM

Thank you Kovid!

Your recipe went fine from command line. Output was an empty file, I think it's related to my login to the page. They block access if few logins were done from different computers. I'll try again tomorrow.

ddavtian · 07-09-2008, 11:21 AM

No luck with WSJ so far.

When I use the posted recipe, I get an empty file. It does find articles (a = item.find('a')), but doesn't pass this condition: "if a and a.has_key('href'):".

When I remove this condition, it gets articles (I print titles and see all of them from the web page), but fails at the end:

Traceback (most recent call last):
File "convert_from.py", line 61, in <module>
File "convert_from.py", line 42, in main
File "calibre\web\feeds\main.pyo", line 134, in run_recipe
File "calibre\web\feeds\news.pyo", line 472, in download
File "calibre\web\feeds\news.pyo", line 578, in build_index
File "c:\docume~1\davidd~1\locals~1\temp\calibre_0.4.76 _j-dnk5_recipes\recipe0
.py", line 89, in parse_index
print title
File "encodings\cp437.pyo", line 12, in encode
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2026' in position
5: character maps to <undefined>

kovidgoyal · 07-09-2008, 12:06 PM

Can you send me your WSJ username and password again. I need it to debug further.

ddavtian · 07-09-2008, 01:23 PM

Quote:

Originally Posted by kovidgoyal

Can you send me your WSJ username and password again. I need it to debug further.

Sent.

I logged out from the page, you should be able to login. If I try calibre recipe few times in a row, they lock the account. Then it takes 5-6 hours to get access again. Painful to test changes.

Thanks in advance.

06-12-2008, 09:12 PM	#16
ddavtian Addict Posts: 274 Karma: 332 Join Date: Nov 2003 Location: San Francisco, USA Device: Sage, Elipsa, Oasis, Galaxy Tab 8U, S22U	This is based on published WSJ profile. I had pm'ed you my login name and password, feel free to use it for testing/reading. PHP Code: ## Copyright (C) 2008 Kovid Goyal kovid@kovidgoyal.net ## This program is free software; you can redistribute it and/or modify ## it under the terms of the GNU General Public License as published by ## the Free Software Foundation; either version 2 of the License, or ## (at your option) any later version. ## ## This program is distributed in the hope that it will be useful, ## but WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ## GNU General Public License for more details. ## ## You should have received a copy of the GNU General Public License along ## with this program; if not, write to the Free Software Foundation, Inc., ## 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. import time import re ## from libprs500.ebooks.lrf.web.profiles import DefaultProfile ## from libprs500.ebooks.BeautifulSoup import BeautifulSoup from calibre.web.feeds.news import BasicNewsRecipe from calibre.ebooks.lrf.web.profiles import DefaultProfile from calibre.ebooks.BeautifulSoup import BeautifulSoup class WallStreetJournalPaper(BasicNewsRecipe): import time import re from calibre.web.feeds.news import BasicNewsRecipe from calibre.ebooks.lrf.web.profiles import DefaultProfile from calibre.ebooks.BeautifulSoup import BeautifulSoup title = 'Wall Street Print Edition' __author__ = 'Kovid Goyal' simultaneous_downloads = 1 max_articles_per_feed = 200 INDEX = 'http://online.wsj.com/page/2_0133.html' timefmt = ' [%a, %b %d, %Y]' no_stylesheets = False html2lrf_options = [('--ignore-tables')] issue_date = time.ctime() print issue_date ## Don't grab articles more than 7 days old oldest_article = 7 def get_browser(self): br = DefaultProfile.get_browser() if self.username is not None and self.password is not None: br.open('http://online.wsj.com/login') br.select_form(name='login_form') br['user'] = self.username br['password'] = self.password br.submit() return br preprocess_regexps = [(re.compile(i[0], re.IGNORECASE \| re.DOTALL), i[1]) for i in [ ## Remove anything before the body of the article. (r'<body.?<!-- article start', lambda match: '<body><!-- article start'), ## Remove any insets from the body of the article. (r'<div id="inset".?</div>.?</div>.?<p', lambda match : '<p'), ## Remove anything after the end of the article. (r'<!-- article end.*?</body>', lambda match : '</body>'), ] ] def parse_index(self): articles = [] soup = self.index_to_soup(self.INDEX) issue_date = time.ctime() for item in soup.findAll('a', attrs={'class':'bold80'}): a = item.find('a') if a and a.has_key('href'): url = item['href'] url = 'http://online.wsj.com'+url.replace('/article', '/article_print') title = self.tag_to_string(item) description = '' articles.append({ 'title':title, 'date':date, 'url':url, 'description':description }) return {'Todays Paper' : articles }

06-12-2008, 09:23 PM	#17
kovidgoyal creator of calibre Posts: 45,821 Karma: 28586150 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Code: return [('Todays newspaper', articles)] Incindentally, how is the WSJ doing post murdoch?

07-05-2008, 01:39 PM	#20
kovidgoyal creator of calibre Posts: 45,821 Karma: 28586150 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Your return statement should be: Code: return [('Today\'s Paper', articles)]

07-06-2008, 12:36 PM	#24
kovidgoyal creator of calibre Posts: 45,821 Karma: 28586150 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Delete the line Code: from calibre.ebooks.lrf.web.profiles import DefaultProfile

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Help with calibre recipes	CaptainJSK	Calibre	1	07-11-2010 02:12 AM
Calibre Recipes and iPad/iBooks	jbambridge	Calibre	8	05-16-2010 05:30 PM
Classification of Recipes in Calibre	wayner	Calibre	3	11-27-2009 10:48 AM
Problem with my recipes (Calibre 0.6.2)	MikeBoud	Calibre	18	08-05-2009 11:20 PM

06-12-2008, 11:26 PM	#18
ddavtian Addict Posts: 274 Karma: 332 Join Date: Nov 2003 Location: San Francisco, USA Device: Sage, Elipsa, Oasis, Galaxy Tab 8U, S22U	I started reading it this year (being able to read on Sony was a big factor for me), so I cannot compare before-after.

07-06-2008, 01:21 AM	#22
kovidgoyal creator of calibre Posts: 45,821 Karma: 28586150 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Use the command feeds2lrf not web2lrf

07-06-2008, 02:15 AM	#23
ddavtian Addict Posts: 274 Karma: 332 Join Date: Nov 2003 Location: San Francisco, USA Device: Sage, Elipsa, Oasis, Galaxy Tab 8U, S22U	Error is from feeds2lrf (I have 0.4.76 calibre): C:\Temp\News>feeds2lrf --debug wsjNew.py --username=xxx --password=xxx Fetching feeds... Sat Jul 05 22:12:09 2008 Traceback (most recent call last): File "convert_from.py", line 61, in <module> File "convert_from.py", line 42, in main File "calibre\web\feeds\main.pyo", line 128, in run_recipe File "calibre\web\feeds\news.pyo", line 825, in __init__ File "calibre\ebooks\lrf\web\profiles\__init__.pyo" , line 174, in __init__ File "calibre\ebooks\lrf\web\profiles\__init__.pyo" , line 204, in build_index AttributeError: 'list' object has no attribute 'keys'

07-06-2008, 08:19 PM	#25
ddavtian Addict Posts: 274 Karma: 332 Join Date: Nov 2003 Location: San Francisco, USA Device: Sage, Elipsa, Oasis, Galaxy Tab 8U, S22U	The same error: Sun Jul 06 16:14:26 2008 Traceback (most recent call last): File "convert_from.py", line 61, in <module> File "convert_from.py", line 42, in main File "calibre\web\feeds\main.pyo", line 128, in run_recipe File "calibre\web\feeds\news.pyo", line 825, in __init__ File "calibre\ebooks\lrf\web\profiles\__init__.pyo" , line 174, in __init__ File "calibre\ebooks\lrf\web\profiles\__init__.pyo" , line 204, in build_index AttributeError: 'list' object has no attribute 'keys'

07-08-2008, 03:17 AM	#27
ddavtian Addict Posts: 274 Karma: 332 Join Date: Nov 2003 Location: San Francisco, USA Device: Sage, Elipsa, Oasis, Galaxy Tab 8U, S22U	Thank you Kovid! Your recipe went fine from command line. Output was an empty file, I think it's related to my login to the page. They block access if few logins were done from different computers. I'll try again tomorrow.

07-09-2008, 11:21 AM	#28
ddavtian Addict Posts: 274 Karma: 332 Join Date: Nov 2003 Location: San Francisco, USA Device: Sage, Elipsa, Oasis, Galaxy Tab 8U, S22U	No luck with WSJ so far. When I use the posted recipe, I get an empty file. It does find articles (a = item.find('a')), but doesn't pass this condition: "if a and a.has_key('href'):". When I remove this condition, it gets articles (I print titles and see all of them from the web page), but fails at the end: Traceback (most recent call last): File "convert_from.py", line 61, in <module> File "convert_from.py", line 42, in main File "calibre\web\feeds\main.pyo", line 134, in run_recipe File "calibre\web\feeds\news.pyo", line 472, in download File "calibre\web\feeds\news.pyo", line 578, in build_index File "c:\docume~1\davidd~1\locals~1\temp\calibre_0.4.76 _j-dnk5_recipes\recipe0 .py", line 89, in parse_index print title File "encodings\cp437.pyo", line 12, in encode UnicodeEncodeError: 'charmap' codec can't encode character u'\u2026' in position 5: character maps to <undefined>

07-09-2008, 12:06 PM	#29
kovidgoyal creator of calibre Posts: 45,821 Karma: 28586150 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Can you send me your WSJ username and password again. I need it to debug further.

Advert

Advert