Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 02-17-2011, 08:41 PM   #1
Jmot
Junior Member
Jmot began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Feb 2011
Device: Kindle
Help Please: remove_tags doesn't work in WSJ Chinese

Hello,

I edited/modified one recipe for WSJ Chinese and use remove_tags and remove_tags_after to remove the unwanted navigation bars or link. Unfortunately, it didn't work. Could someone please take a look and offer some opinions? Thanks a lot.


from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1277443666(BasicNewsRecipe):
title = u'x WSJ 華爾街日報'
oldest_article = 32
max_articles_per_feed = 3

feeds = [
(u'\u8981\u805E', u'http://chinese.wsj.com/big5/rss01.xml'),
(u'\u7279\u5BEB', u'http://chinese.wsj.com/big5/rss02.xml'),
#(u'\u4E2D\u6E2F\u53F0', u'http://chinese.wsj.com/big5/rssbch.xml'),
#(u'\u570B\u969B\u8CA1\u7D93', u'http://chinese.wsj.com/big5/rssglobal.xml'),
#(u'\u4E2D\u570B\u80A1\u5E02', u'http://chinese.wsj.com/big5/rsschinastock.xml'),
#(u'\u9999\u6E2F\u80A1\u5E02', u'http://chinese.wsj.com/big5/rssHKstock.xml'),
#(u'\u5916\u532F\u5E02\u5834', u'http://chinese.wsj.com/big5/rssforex.xml')
#(u'\u5168\u7403\u91D1\u878D\u5E02\u5834', u'http://chinese.wsj.com/big5/rssmarkets.xml')
(u'\u79D1\u6280', u'http://chinese.wsj.com/big5/rsstech.xml')
#(u'\u80FD\u6E90\u8207\u6C7D\u8ECA', u'http://chinese.wsj.com/big5/rssautoene.xml')
]

remove_tags = [dict(name='div', attrs={'class':['homepage']})]
remove_tags_after = dict(id='bodypart')
remove_javascript = True
Jmot is offline   Reply With Quote
Old 02-18-2011, 11:07 AM   #2
sorin
Connoisseur
sorin began at the beginning.
 
Posts: 73
Karma: 44
Join Date: Sep 2010
Device: kindle 3
I'm not an expert but i think you have to set keep_only_tags - this is the tag which contains your article.
Check others recipe examples from folder where Calibre is installed: Calibre2\resources\recipes.
I use sciencedaily.recipe template.

Last edited by sorin; 02-18-2011 at 12:10 PM.
sorin is offline   Reply With Quote
 
Advertisement
Old 02-18-2011, 11:17 PM   #3
Jmot
Junior Member
Jmot began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Feb 2011
Device: Kindle
Well, I did use the "keep_only_tags" as the following and confirm that there are "<div id="bodypart">" in the HMTL. Unfortunately, it still does not work. So I'm wondering if I'm missing something. Any suggestion? Thanks.

===
from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1277443666(BasicNewsRecipe):
title = u'x WSJ 華爾街日報'
oldest_article = 32
max_articles_per_feed = 2

feeds = [
(u'\u8981\u805E', u'http://chinese.wsj.com/big5/rss01.xml'),
(u'Report', u'http://chinese.wsj.com/big5/rss02.xml'),
#(u'\u4E2D\u6E2F\u53F0', u'http://chinese.wsj.com/big5/rssbch.xml'),
#(u'\u570B\u969B\u8CA1\u7D93', u'http://chinese.wsj.com/big5/rssglobal.xml'),
#(u'\u4E2D\u570B\u80A1\u5E02', u'http://chinese.wsj.com/big5/rsschinastock.xml'),
#(u'\u9999\u6E2F\u80A1\u5E02', u'http://chinese.wsj.com/big5/rssHKstock.xml'),
#(u'\u5916\u532F\u5E02\u5834', u'http://chinese.wsj.com/big5/rssforex.xml')
#(u'\u5168\u7403\u91D1\u878D\u5E02\u5834', u'http://chinese.wsj.com/big5/rssmarkets.xml')
#(u'\u79D1\u6280', u'http://chinese.wsj.com/big5/rsstech.xml')
#(u'\u80FD\u6E90\u8207\u6C7D\u8ECA', u'http://chinese.wsj.com/big5/rssautoene.xml')
]

remove_javascript = True

keep_only_tags = [
dict(name='div', attrs={'id':'bodypart'})
]

# remove_tags = [dict(name='div', attrs={'class':['homepage']})]
#remove_tags_after = dict(id='bodypart')
#remove_javascript = True
Jmot is offline   Reply With Quote
Old 02-19-2011, 03:47 AM   #4
sorin
Connoisseur
sorin began at the beginning.
 
Posts: 73
Karma: 44
Join Date: Sep 2010
Device: kindle 3
You have to set some spaces before variables (like title ..) from your class. Check this recipe:

from calibre.web.feeds.news import BasicNewsRecipe
class AdvancedUserRecipe1277443666(BasicNewsRecipe):
title = u'x WSJ ?????'
oldest_article = 32
max_articles_per_feed = 2

feeds = [
(u'\u8981\u805E', u'http://chinese.wsj.com/big5/rss01.xml'),
(u'Report', u'http://chinese.wsj.com/big5/rss02.xml'),
#(u'\u4E2D\u6E2F\u53F0', u'http://chinese.wsj.com/big5/rssbch.xml'),
#(u'\u570B\u969B\u8CA1\u7D93', u'http://chinese.wsj.com/big5/rssglobal.xml'),
#(u'\u4E2D\u570B\u80A1\u5E02', u'http://chinese.wsj.com/big5/rsschinastock.xml'),
#(u'\u9999\u6E2F\u80A1\u5E02', u'http://chinese.wsj.com/big5/rssHKstock.xml'),
#(u'\u5916\u532F\u5E02\u5834', u'http://chinese.wsj.com/big5/rssforex.xml')
#(u'\u5168\u7403\u91D1\u878D\u5E02\u5834', u'http://chinese.wsj.com/big5/rssmarkets.xml')
#(u'\u79D1\u6280', u'http://chinese.wsj.com/big5/rsstech.xml')
#(u'\u80FD\u6E90\u8207\u6C7D\u8ECA', u'http://chinese.wsj.com/big5/rssautoene.xml')
]

remove_javascript = True

keep_only_tags = [
dict(name='div', attrs={'id':'bodypart'})
]

# remove_tags = [dict(name='div', attrs={'class':['homepage']})]
#remove_tags_after = dict(id='bodypart')
#remove_javascript = True


---------------
Read this thread, there is a command line very useful for testing recipes

Last edited by sorin; 02-19-2011 at 04:16 AM.
sorin is offline   Reply With Quote
Old 02-20-2011, 11:55 PM   #5
Jmot
Junior Member
Jmot began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Feb 2011
Device: Kindle
Hi Sorin,

I did put the space before each line. The result is the same. Any other suggestions? Thanks.
Jmot is offline   Reply With Quote
Old 02-21-2011, 05:10 AM   #6
sorin
Connoisseur
sorin began at the beginning.
 
Posts: 73
Karma: 44
Join Date: Sep 2010
Device: kindle 3
Copy your recipe in folder where Calibre is installed and run this in command prompt:
C:\Program Files\Calibre2>ebook-convert YourRecipe.recipe D:\temp –test -vv
Check console for errors and index.html from D:\temp.
sorin is offline   Reply With Quote
Reply

Tags
recipe, remove_tags, remove_tags_after

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
WSJ - Where is it? cbnash Nook Developer's Corner 4 12-31-2010 02:48 PM
Read Chinese books in Sony Reader PRS900 using Chinese Fonts PSL ePub 3 10-08-2010 09:11 AM
PRS-900 WSJ subscription through Sony vs WSJ direct advocate2 Sony Reader 14 01-29-2010 12:52 PM
Chinese Support : book name & fetching chinese webs tnzshn Calibre 12 05-02-2009 02:21 AM
Can calibre work in Chinese WindowsVista? AndyJing Calibre 6 07-30-2008 11:10 PM


All times are GMT -4. The time now is 02:10 AM.


MobileRead.com is a privately owned, operated and funded community.