12-21-2009, 02:40 AM | #1 |
Recovering Gadget Addict
Posts: 5,381
Karma: 676161
Join Date: May 2004
Location: Pittsburgh, PA
Device: iPad
|
WSJ from Todays Paper (not RSS feeds)
Well I finally downloaded and tried Calibre today. Very nice, and very impressive. (I was hooked by the interesting comment by kovidgoyal in the thread about Sony Daily Reader support for the WSJ, and how you can already do that with Calibre.)
So my primary goal was to create a custom feed that will create an ebook file with the content from the "Today's Paper - US" web page for online subscribers, and it's not working for me yet. Has anyone successfully done this? I'm basing my attempt on this recipe. I get an error with the DefaultProfile import. (import... from calibre.ebooks.lrf.web.profiles import DefaultProfile) If I comment it out, then later in the recipe it complains that it doesn't have that DefaultProfile. Sorry, shut down the PC I used for that. Can re-run it again if the specific error is needed. But my guess is that the program has been updated (probably so it can use the .recipe files, which seems to be a recent development) and the DefaultProfile has been replaced. Also, even worse, when I add this custom recipe, it doesn't show the login user/passwd options. (Do I have to hard code them as literals?) Anyway, if someone can help out this novice, I'd sure appreciate it! |
12-21-2009, 10:12 AM | #2 |
creator of calibre
Posts: 43,853
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
That recipe is outdated, the recipe system has changed. Try reading the custom news section of the calibre User Manual to learn how to create custom recipes.
|
12-21-2009, 03:06 PM | #3 | |
Recovering Gadget Addict
Posts: 5,381
Karma: 676161
Join Date: May 2004
Location: Pittsburgh, PA
Device: iPad
|
Quote:
(I'll also provide a few tips back on this thread if I get it working, in case someone else is trying to do the same thing.) |
|
12-24-2009, 08:54 PM | #4 |
Recovering Gadget Addict
Posts: 5,381
Karma: 676161
Join Date: May 2004
Location: Pittsburgh, PA
Device: iPad
|
Well, I took the existing WSJ feed, and basically converted the feeds into a parse_index. The result (from cmd line test is an error parsing the recipe) wasn't quite what I hoped for.
Question 1: Can I test a recipe from cmd line when using a login? Do I have to add it to the recipe directory and/or import it with the Calibre GUI? Question 2: Python doesn't seem to count blank lines in the source code when reporting errors... what's a nice open source editor that will show matching line numbers (or is there something else going on besides blank lines?) Here's the new recipe... Code:
#!/usr/bin/env python __license__ = 'GPL v3' __copyright__ = '2008, Kovid Goyal kovid@kovidgoyal.net' __docformat__ = 'restructuredtext en' from calibre.web.feeds.news import BasicNewsRecipe # http://online.wsj.com/page/us_in_todays_paper.html class WallStreetJournal(BasicNewsRecipe): title = 'The Wall Street Journal' __author__ = 'Kovid Goyal and Sujata Raman' description = 'News and current affairs.' INDEX = 'http://online.wsj.com/page/us_in_todays_paper.html' needs_subscription = True language = 'en' max_articles_per_feed = 200 timefmt = ' [%a, %b %d, %Y]' no_stylesheets = True extra_css = '''h1{color:#093D72 ; font-size:large ; font-family:Georgia,"Century Schoolbook","Times New Roman",Times,serif; } h2{color:#474537; font-family:Georgia,"Century Schoolbook","Times New Roman",Times,serif; font-size:small; font-style:italic;} .subhead{color:gray; font-family:Georgia,"Century Schoolbook","Times New Roman",Times,serif; font-size:small; font-style:italic;} .insettipUnit {color:#666666; font-family:Arial,Sans-serif;font-size:xx-small } .targetCaption{ font-size:x-small; color:#333333; font-family:Arial,Helvetica,sans-serif} .article{font-family :Arial,Helvetica,sans-serif; font-size:x-small} .tagline {color:#333333; font-size:xx-small} .dateStamp {color:#666666; font-family:Arial,Helvetica,sans-serif} h3{color:blue ;font-family:Arial,Helvetica,sans-serif; font-size:xx-small} .byline{color:blue;font-family:Arial,Helvetica,sans-serif; font-size:xx-small} h6{color:#333333; font-family:Georgia,"Century Schoolbook","Times New Roman",Times,serif; font-size:small;font-style:italic; } .paperLocation{color:#666666; font-size:xx-small}''' remove_tags_before = dict(name='h1') remove_tags = [ dict(id=["articleTabs_tab_article", "articleTabs_tab_comments", "articleTabs_tab_interactive","articleTabs_tab_video","articleTabs_tab_map","articleTabs_tab_slideshow"]), {'class':['footer_columns','network','insetCol3wide','interactive','video','slideshow','map','insettip','insetClose','more_in', "insetContent", 'articleTools_bottom', 'aTools', "tooltip", "adSummary", "nav-inline"]}, dict(rel='shortcut icon'), ] remove_tags_after = [dict(id="article_story_body"), {'class':"article story"},] def get_browser(self): br = BasicNewsRecipe.get_browser() if self.username is not None and self.password is not None: br.open('http://commerce.wsj.com/auth/login') br.select_form(nr=0) br['user'] = self.username br['password'] = self.password br.submit() return br def postprocess_html(self, soup, first): for tag in soup.findAll(name=['table', 'tr', 'td']): tag.name = 'div' for tag in soup.findAll('div', dict(id=["articleThumbnail_1", "articleThumbnail_2", "articleThumbnail_3", "articleThumbnail_4", "articleThumbnail_5", "articleThumbnail_6", "articleThumbnail_7"])): tag.extract() return soup def get_article_url(self, article): try: return article.feedburner_origlink.split('?')[0] except AttributeError: return article.link.split('?')[0] def cleanup(self): self.browser.open('http://online.wsj.com/logout?url=http://online.wsj.com') def parse_index(self): articles = [] soup = self.index_to_soup(self.INDEX) for item in soup.findAll(lambda tag: tag.name == 'div' and not tag.attrs): a = item.find('a') if a and a.has_key('href'): url = a['href'] if not url.startswith('http://'): url = ' http://online.wsj.com'+url title = self.tag_to_string(a) if title in ('INTERACTIVE MAP', 'SIDEBAR'): continue title = title.replace('&', '&') date = '' description = '' articles.append({ 'title':title, 'date':date, 'url':url, 'description':description, 'content':'' }) return [('Todays US WSJ Articles', articles)] Code:
>ebook-convert WSJ3.recipe <theTargetDirectory> --test -vv Resolved conversion options {'asciiize': False, 'author_sort': None, 'authors': None, 'base_font_size': 0, 'book_producer': None, 'chapter': None, 'chapter_mark': 'pagebreak', 'comments': None, 'cover': None, 'debug_pipeline': None, 'disable_font_rescaling': False, 'dont_justify': False, 'extra_css': None, 'font_size_mapping': None, 'footer_regex': '(?i)(?<=<hr>)((\\s*<a name=\\d+></a>((<img.+?>)*<br>\\s*)?\\d+ <br>\\s*.*?\\s*)|(\\s*<a name=\\d+></a>((<img.+?>)*<br>\\s*)?.*?<br>\\s*\\d+))(? =<br>)', 'header_regex': '(?i)(?<=<hr>)((\\s*<a name=\\d+></a>((<img.+?>)*<br>\\s*)?\\d+ <br>\\s*.*?\\s*)|(\\s*<a name=\\d+></a>((<img.+?>)*<br>\\s*)?.*?<br>\\s*\\d+))(? =<br>)', 'input_encoding': None, 'input_profile': <calibre.customize.profiles.InputProfile object at 0x03E8F110> , 'insert_blank_line': False, 'insert_metadata': False, 'isbn': None, 'language': None, 'level1_toc': None, 'level2_toc': None, 'level3_toc': None, 'line_height': 0, 'linearize_tables': False, 'lrf': False, 'margin_bottom': 5.0, 'margin_left': 5.0, 'margin_right': 5.0, 'margin_top': 5.0, 'max_toc_links': 50, 'no_chapters_in_toc': False, 'no_inline_navbars': False, 'output_profile': <calibre.customize.profiles.OutputProfile object at 0x03E8F25 0>, 'page_breaks_before': None, 'password': None, 'prefer_metadata_cover': False, 'preprocess_html': False, 'pretty_print': True, 'publisher': None, 'rating': None, 'read_metadata_from_opf': None, 'remove_first_image': False, 'remove_footer': False, 'remove_header': False, 'remove_paragraph_spacing': False, 'remove_paragraph_spacing_indent_size': 1.5, 'series': None, 'series_index': None, 'tags': None, 'test': True, 'title': None, 'title_sort': None, 'toc_filter': None, 'toc_threshold': 6, 'use_auto_toc': False, 'username': None, 'verbose': 2} 1% Converting input to HTML... InputFormatPlugin: Recipe Input running Failed to compile downloaded recipe. Falling back to builtin one Traceback (most recent call last): File "site-packages\calibre\web\feeds\input.py", line 58, in convert File "site-packages\calibre\web\feeds\recipes\__init__.py", line 31, in compil e_recipe TypeError: 'NoneType' object is unsubscriptable Python function terminated unexpectedly 'NoneType' object is unsubscriptable (Error Code: 1) Traceback (most recent call last): File "site.py", line 103, in main File "site.py", line 85, in run_entry_point File "site-packages\calibre\ebooks\conversion\cli.py", line 249, in main File "site-packages\calibre\ebooks\conversion\plumber.py", line 736, in run File "site-packages\calibre\customize\conversion.py", line 208, in __call__ File "site-packages\calibre\web\feeds\input.py", line 71, in convert File "site-packages\calibre\web\feeds\recipes\__init__.py", line 31, in compil e_recipe TypeError: 'NoneType' object is unsubscriptable Thanks! |
12-24-2009, 09:18 PM | #5 |
creator of calibre
Posts: 43,853
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
I just added a from todays newspaper recipe for the WSJ to calibre. If you're using the latest calibre you should be able to use it already.
You can pass the username and password using the command line with --username and --password |
12-26-2009, 02:38 AM | #6 |
Recovering Gadget Addict
Posts: 5,381
Karma: 676161
Join Date: May 2004
Location: Pittsburgh, PA
Device: iPad
|
Thanks! This obviously took more work than I had expected. I plan to use it a lot!!!!
Sent a small donation as well having seen how useful Calibre is going to be for me. Don't normally like to mention that sort of thing publicly, but I'm hoping maybe it will cause a few others benefiting from Calibre to consider doing likewise. You've done an amazing job with it, and I am sure it represents many long hours and days and weeks (and so on) of work! But for this recipe, I do think one change is needed... *) Some articles are missing. So I plan to increase the limit of 10 articles per feed. Hopefully it's that simple. Didn't do it yet - I don't want to get shut out of the site for too many downloads. So now I'm a bit excited about this, and may dabble with a couple more things to enhance the WSJ recipe for my own personal preferences... *) Bigger fonts. Should be easily fixed in extra_css *) WSJ top level section menu is strange. This is just an oddity of the WSJ site, not the recipe which reflects the site layout. But all in all, I think it's probably easier to navigate if that top menu layer with the 5 sections (PAGE ONE, PAGE ONE, SECTION B, MONEY AND INVESTING, PERSONAL JOURNAL) is ignored, giving instead just a list of all the articles at the top level. I think I should be able to make that change. Will give it a go at some point when I have a little time to dabble again. Thanks again - this is great!!! |
12-26-2009, 10:05 AM | #7 |
creator of calibre
Posts: 43,853
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Cool, thanks for the donation and if you make any changes that you think will be of general interest, let me know.
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Is there a good way to convert partial rss to full rss feeds. | Zorz | Other formats | 5 | 05-29-2010 12:17 PM |
RSS feeds | peejay | PocketBook | 2 | 04-26-2010 05:16 AM |
Questions about downloaded WSJ and other paper and magzines. | frankbaozhu | Sony Reader | 6 | 12-17-2009 08:29 PM |
Questions about downloaded WSJ and other paper and magzines. | frankbaozhu | Calibre | 0 | 12-17-2009 01:50 PM |
RSS Feeds | troutyluc | iRex | 5 | 07-04-2008 08:18 AM |