|
|
#16 |
|
Addict
![]() ![]() ![]() ![]() Posts: 274
Karma: 332
Join Date: Nov 2003
Location: San Francisco, USA
Device: Sage, Elipsa, Oasis, Galaxy Tab 8U, S22U
|
This is based on published WSJ profile.
I had pm'ed you my login name and password, feel free to use it for testing/reading. PHP Code:
|
|
|
|
|
|
#17 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,631
Karma: 28549046
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Code:
return [('Todays newspaper', articles)]
|
|
|
|
| Advert | |
|
|
|
|
#18 |
|
Addict
![]() ![]() ![]() ![]() Posts: 274
Karma: 332
Join Date: Nov 2003
Location: San Francisco, USA
Device: Sage, Elipsa, Oasis, Galaxy Tab 8U, S22U
|
I started reading it this year (being able to read on Sony was a big factor for me), so I cannot compare before-after.
|
|
|
|
|
|
#19 |
|
Addict
![]() ![]() ![]() ![]() Posts: 274
Karma: 332
Join Date: Nov 2003
Location: San Francisco, USA
Device: Sage, Elipsa, Oasis, Galaxy Tab 8U, S22U
|
|
|
|
|
|
|
#20 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,631
Karma: 28549046
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Your return statement should be:
Code:
return [('Today\'s Paper', articles)]
|
|
|
|
| Advert | |
|
|
|
|
#21 | |
|
Addict
![]() ![]() ![]() ![]() Posts: 274
Karma: 332
Join Date: Nov 2003
Location: San Francisco, USA
Device: Sage, Elipsa, Oasis, Galaxy Tab 8U, S22U
|
Quote:
I tried it and got a new error: Traceback (most recent call last): File "convert_from.py", line 61, in <module> File "convert_from.py", line 42, in main File "calibre\web\feeds\main.pyo", line 128, in run_recipe File "calibre\web\feeds\news.pyo", line 825, in __init__ File "calibre\ebooks\lrf\web\profiles\__init__.pyo" , line 174, in __init__ File "calibre\ebooks\lrf\web\profiles\__init__.pyo" , line 204, in build_index AttributeError: 'list' object has no attribute 'keys' I put few print statements to track the flow, it never gets into this loop: for item in soup.findAll('a', attrs={'class':'bold80'}): I checked the web page, nothing was changed there. Articles are identifed correctly. Here is a link from the source code: <a class="bold80" href="/article/SB121521047990229423.html?mod=todays_us_page_one"> Kovid, your help is very much appreciated. Thanks in advance. |
|
|
|
|
|
|
#22 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,631
Karma: 28549046
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Use the command feeds2lrf not web2lrf
|
|
|
|
|
|
#23 |
|
Addict
![]() ![]() ![]() ![]() Posts: 274
Karma: 332
Join Date: Nov 2003
Location: San Francisco, USA
Device: Sage, Elipsa, Oasis, Galaxy Tab 8U, S22U
|
Error is from feeds2lrf (I have 0.4.76 calibre):
C:\Temp\News>feeds2lrf --debug wsjNew.py --username=xxx --password=xxx Fetching feeds... Sat Jul 05 22:12:09 2008 Traceback (most recent call last): File "convert_from.py", line 61, in <module> File "convert_from.py", line 42, in main File "calibre\web\feeds\main.pyo", line 128, in run_recipe File "calibre\web\feeds\news.pyo", line 825, in __init__ File "calibre\ebooks\lrf\web\profiles\__init__.pyo" , line 174, in __init__ File "calibre\ebooks\lrf\web\profiles\__init__.pyo" , line 204, in build_index AttributeError: 'list' object has no attribute 'keys' |
|
|
|
|
|
#24 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,631
Karma: 28549046
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Delete the line
Code:
from calibre.ebooks.lrf.web.profiles import DefaultProfile
|
|
|
|
|
|
#25 |
|
Addict
![]() ![]() ![]() ![]() Posts: 274
Karma: 332
Join Date: Nov 2003
Location: San Francisco, USA
Device: Sage, Elipsa, Oasis, Galaxy Tab 8U, S22U
|
The same error:
Sun Jul 06 16:14:26 2008 Traceback (most recent call last): File "convert_from.py", line 61, in <module> File "convert_from.py", line 42, in main File "calibre\web\feeds\main.pyo", line 128, in run_recipe File "calibre\web\feeds\news.pyo", line 825, in __init__ File "calibre\ebooks\lrf\web\profiles\__init__.pyo" , line 174, in __init__ File "calibre\ebooks\lrf\web\profiles\__init__.pyo" , line 204, in build_index AttributeError: 'list' object has no attribute 'keys' |
|
|
|
|
|
#26 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,631
Karma: 28549046
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
The attached recipe works for me with the command line
Code:
feeds2lrf test.py Code:
## Copyright (C) 2008 Kovid Goyal kovid@kovidgoyal.net
## This program is free software; you can redistribute it and/or modify
## it under the terms of the GNU General Public License as published by
## the Free Software Foundation; either version 2 of the License, or
## (at your option) any later version.
##
## This program is distributed in the hope that it will be useful,
## but WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
## GNU General Public License for more details.
##
## You should have received a copy of the GNU General Public License along
## with this program; if not, write to the Free Software Foundation, Inc.,
## 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
import time
import re
## from libprs500.ebooks.lrf.web.profiles import DefaultProfile
## from libprs500.ebooks.BeautifulSoup import BeautifulSoup
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup
class WallStreetJournalPaper(BasicNewsRecipe):
import time
import re
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.lrf.web.profiles import DefaultProfile
from calibre.ebooks.BeautifulSoup import BeautifulSoup
title = 'Wall Street Print Edition'
__author__ = 'Kovid Goyal'
simultaneous_downloads = 1
max_articles_per_feed = 200
INDEX = 'http://online.wsj.com/page/2_0133.html'
timefmt = ' [%a, %b %d, %Y]'
no_stylesheets = False
html2lrf_options = [('--ignore-tables')]
issue_date = time.ctime()
print issue_date
## Don't grab articles more than 7 days old
oldest_article = 7
def get_browser(self):
br = DefaultProfile.get_browser()
if self.username is not None and self.password is not None:
br.open('http://online.wsj.com/login')
br.select_form(name='login_form')
br['user'] = self.username
br['password'] = self.password
br.submit()
return br
preprocess_regexps = [(re.compile(i[0], re.IGNORECASE | re.DOTALL), i[1]) for i in
[
## Remove anything before the body of the article.
(r'<body.*?<!-- article start', lambda match: '<body><!-- article start'),
## Remove any insets from the body of the article.
(r'<div id="inset".*?</div>.?</div>.?<p', lambda match : '<p'),
## Remove anything after the end of the article.
(r'<!-- article end.*?</body>', lambda match : '</body>'),
]
]
def parse_index(self):
articles = []
soup = self.index_to_soup(self.INDEX)
issue_date = time.ctime()
for item in soup.findAll('a', attrs={'class':'bold80'}):
a = item.find('a')
if a and a.has_key('href'):
url = item['href']
url = 'http://online.wsj.com'+url.replace('/article', '/article_print')
title = self.tag_to_string(item)
description = ''
articles.append({
'title':title,
'date':date,
'url':url,
'description':description
})
return [('Todays Paper', articles)]
|
|
|
|
|
|
#27 |
|
Addict
![]() ![]() ![]() ![]() Posts: 274
Karma: 332
Join Date: Nov 2003
Location: San Francisco, USA
Device: Sage, Elipsa, Oasis, Galaxy Tab 8U, S22U
|
Thank you Kovid!
Your recipe went fine from command line. Output was an empty file, I think it's related to my login to the page. They block access if few logins were done from different computers. I'll try again tomorrow. |
|
|
|
|
|
#28 |
|
Addict
![]() ![]() ![]() ![]() Posts: 274
Karma: 332
Join Date: Nov 2003
Location: San Francisco, USA
Device: Sage, Elipsa, Oasis, Galaxy Tab 8U, S22U
|
No luck with WSJ so far.
When I use the posted recipe, I get an empty file. It does find articles (a = item.find('a')), but doesn't pass this condition: "if a and a.has_key('href'):". When I remove this condition, it gets articles (I print titles and see all of them from the web page), but fails at the end: Traceback (most recent call last): File "convert_from.py", line 61, in <module> File "convert_from.py", line 42, in main File "calibre\web\feeds\main.pyo", line 134, in run_recipe File "calibre\web\feeds\news.pyo", line 472, in download File "calibre\web\feeds\news.pyo", line 578, in build_index File "c:\docume~1\davidd~1\locals~1\temp\calibre_0.4.76 _j-dnk5_recipes\recipe0 .py", line 89, in parse_index print title File "encodings\cp437.pyo", line 12, in encode UnicodeEncodeError: 'charmap' codec can't encode character u'\u2026' in position 5: character maps to <undefined> |
|
|
|
|
|
#29 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,631
Karma: 28549046
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Can you send me your WSJ username and password again. I need it to debug further.
|
|
|
|
|
|
#30 | |
|
Addict
![]() ![]() ![]() ![]() Posts: 274
Karma: 332
Join Date: Nov 2003
Location: San Francisco, USA
Device: Sage, Elipsa, Oasis, Galaxy Tab 8U, S22U
|
Quote:
I logged out from the page, you should be able to login. If I try calibre recipe few times in a row, they lock the account. Then it takes 5-6 hours to get access again. Painful to test changes. Thanks in advance. |
|
|
|
|
![]() |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Help with calibre recipes | CaptainJSK | Calibre | 1 | 07-11-2010 02:12 AM |
| Calibre Recipes and iPad/iBooks | jbambridge | Calibre | 8 | 05-16-2010 05:30 PM |
| Classification of Recipes in Calibre | wayner | Calibre | 3 | 11-27-2009 10:48 AM |
| Problem with my recipes (Calibre 0.6.2) | MikeBoud | Calibre | 18 | 08-05-2009 11:20 PM |