11-03-2010, 04:47 AM | #1 |
Member
Posts: 15
Karma: 10
Join Date: Oct 2010
Location: Slovenia
Device: Kindle 3G
|
Help with print_url and/or split_url
Since I'm a newbie I try to learn by examples I find here. I created a recipe, but have a problem with "unexpected indent" error in part with print_version.
The task is (should be) simple: replace article URL with print version URL.
When I try to add/update recipe, I get error mentioned above in line 56 (It's the: return print_url). Can someone please take a look and help me out, please. Code:
__license__ = 'GPL v3' __copyright__ = '2010, BlonG' ''' www.rtvslo.si ''' from calibre.web.feeds.news import BasicNewsRecipe class MMCRTV(BasicNewsRecipe): title = u'MMC RTV' __author__ = u'BlonG' # 10 description = u"Prvi interaktivni multimedijski portal, MMC RTV Slovenija" oldest_article = 3 max_articles_per_feed = 20 encoding = 'cp1250' language = 'sl' no_stylesheets = True use_embedded_content = False cover_url = 'http://img.rtvslo.si/_static/images/rtvportal_logo.png' # 20 extra_css = ''' h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;} h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;} p{font-family:Arial,Helvetica,sans-serif;font-size:small;} body{font-family:Helvetica,Arial,sans-serif;font-size:small;} ''' html2lrf_options = ['--base-font-size', '10'] # 30 # keep_only_tags = [ # dict(name='div', attrs={'id':'contents'}), # dict(name='div', attrs={'class':'entry-content'}), # ] # remove_tags = [ # dict(name='div', attrs={'class':'fb_article_top'}), # dict(name='div', attrs={'class':'related'}), # dict(name='div', attrs={'class':'fb_article_foot'}), # 40 # dict(name='div', attrs={'class':'spreading'}), # dict(name='dl', attrs={'class':'ad'}), # dict(name='p', attrs={'class':'report'}), # dict(name='div', attrs={'class':'hfeed comments'}), # dict(name='dl', attrs={'id':'entryPanel'}), # dict(name='dl', attrs={'class':'infopush ip_wide'}), # dict(name='div', attrs={'class':'sidebar'}), # dict(name='dl', attrs={'class':'bottom'}), # dict(name='div', attrs={'id':'footer'}), # 50 # ] def print_version(self, url): split_url = url.split("/") print_url = 'http://www.rtvslo.si/index.php?c_mod=news&op=print&id=' + split_url[1] return print_url feeds = [ (u'Vse novice', u'http://www.rtvslo.si/feeds/00.xml') ,(u'Okolje', u'http://www.rtvslo.si/feeds/12.xml') ,(u'Znanost in tehnologija', u'http://www.rtvslo.si/feeds/09.xml') ,(u'Zabava', u'http://www.rtvslo.si/feeds/06.xml') ] |
11-03-2010, 03:59 PM | #2 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
You have mixed tabs and spaces. Delete all the tabs preceding the lines after def print_version(self, url): and replace with spaces. I use an editor (UltraEdit) that does this by default.
|
Advert | |
|
11-04-2010, 03:05 AM | #3 |
Member
Posts: 15
Karma: 10
Join Date: Oct 2010
Location: Slovenia
Device: Kindle 3G
|
Thank you very much! (I'm step closer to final version of recipe.)
I did come to next two "challenges". But let's take it step by step. In recipe I use this code: Code:
def print_version(self, url): split_url = url.split("/") print 'URL1= ', split_url[1] print 'URL2= ', split_url[2] print 'URL3= ', split_url[3] print 'URL4= ', split_url[4] print 'URL5= ', split_url[5] print 'URL6= ', split_url[6] print_url = 'http://www.rtvslo.si/index.php?c_mod=news&op=print&id=' + split_url[6] print 'THIS URL WILL PRINT: ',print_url return print_url So, the links to articles are like either this:
I looked through forum, but couldn’t find similar problem or solution. I understand logic, but don’t know the syntax for it, so please help me: if split_url[6] is empty (meaning that link has only 5 segments. Or - another idea - maybe somehow to check if split_url[6] is number?) I hope this is possible, I just don't know how.
print_url = ‘http://…id=’ + split_url[5] (create link to print version of article with 5th segment) or else print_url = ‘http://…id=’ + split_url[6] (create link to print version of article with 6th segment) return print_url |
11-04-2010, 04:18 AM | #4 |
Zealot
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
|
try using print_url = 'http://www.rtvslo.si/index.php?c_mod=news&op=print&id=' + split_url[-1]
that will give you the last split every time... |
11-04-2010, 04:43 AM | #5 |
Member
Posts: 15
Karma: 10
Join Date: Oct 2010
Location: Slovenia
Device: Kindle 3G
|
Marbs - thank you! I didn't know that it can be that simple...
I busted my head with this: Code:
def print_version(self, url): split_url = url.split("/") if len(split_url[5]) == 6: # Check if segment 5 contains six digit ID of article print_url = 'http://www.rtvslo.si/index.php?c_mod=news&op=print&id=' + split_url[5] print 'URL 5 Print: ',print_url # show URL to printed version of article in log file elif len(split_url[6]) == 6: # If segment 5 has no ID then check if segment 6 contains six digit ID of article print_url = 'http://www.rtvslo.si/index.php?c_mod=news&op=print&id=' + split_url[6] print 'URL 6 Print: ',print_url # show URL to printed version of article in log file else: #If segment 5 and segment 6 contain no ID then... print 'URL error: ', print_url # show error message return print_url Last edited by BlonG; 11-04-2010 at 04:53 AM. |
Advert | |
|
11-04-2010, 04:55 AM | #6 |
Zealot
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
|
i'm happy i can help, its not like i know that much.
|
11-04-2010, 07:30 AM | #7 |
Member
Posts: 15
Karma: 10
Join Date: Oct 2010
Location: Slovenia
Device: Kindle 3G
|
Order of tags inside HTML
@marbs: we're all learning... I'm just starting with building recipes and I'm happy when more experienced users help me.
OK, my final challenge for this recipe (hopefully). The printed version of article is formated like this: http://www.rtvslo.si/index.php?c_mod...rint&id=243073
In Kindle (or Calibre viewer) order is:
I don't mind much about category and print buttom, but i'd like to see/read left column first and after that right column. So far I figured out the HTML structure: Code:
<div id="newsbody"> |
11-04-2010, 08:00 AM | #8 |
Zealot
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
|
i ran in to this once: keep only tags keeps tags in the order you write them. what i would do is to "keep only" the tags that you want, hoping that keep tags is stronger than "remove tags" then i would remove the news blocks inside news body. something like this:
Code:
keep_only_tags = [ dict(name='div', attrs={'class':'title'}), dict(name='div', attrs={'id':'newsbody'}), dict(name='div', attrs={'id':'newsblocks'}), ] remove_tags=[ dict(name='div', attrs={'id':'newsblocks'}), ] Last edited by marbs; 11-04-2010 at 08:11 AM. |
11-04-2010, 10:14 AM | #9 |
Member
Posts: 15
Karma: 10
Join Date: Oct 2010
Location: Slovenia
Device: Kindle 3G
|
Hm... I was thinkig of something similar.
Like this: 1. take original article and remove 'newsblock' tag (basicly remove right column) > the result: article[1] 2. take original article and remove everything but the 'newsblock' tag (keep only right column) > the result: article[2] Combine in this order: article[1] + article[2]. Now... how to do that? |
11-05-2010, 02:31 AM | #11 |
Member
Posts: 15
Karma: 10
Join Date: Oct 2010
Location: Slovenia
Device: Kindle 3G
|
|
11-05-2010, 05:45 AM | #12 |
Zealot
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
|
you can edit the html if you know how to do that...
|
11-06-2010, 08:25 AM | #13 |
Member
Posts: 15
Karma: 10
Join Date: Oct 2010
Location: Slovenia
Device: Kindle 3G
|
I have read the link you provided (The Beautiful Soup documentation), but could't come up with solution. Maybe the answer is in this part. However - I don't understand this programing. (What or how to use tags instead of text. In the HTML structure I posted above I see "tag" only for right column, left column in not in "tag".)
The more I read the more I have feeling that solution is to create two temp articles (one withous right column and one with only right column). And then combine two temp articles one after the other. Still: I don't know hot to put this into commands in Calibre. @marbs: I know a little bit about editing HTML - but I'm not sure what have you in mind. Anyway - any help with this is very appreciated! |
11-06-2010, 02:14 PM | #14 |
Zealot
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
|
i havent had much luck with preprocess_html
but try coping this as is to your code:
Code:
def preprocess_html(self, soup): newsbody= soup.find('div',attrs={'id':'newsbody'}) newsblocks=nesbody.find('div',attrs=['id':'newsblocks']) newsbody.insert(-1, newsblocks) return soup i thought of it again, you may want to try this instead: Code:
def preprocess_html(self, soup): newsblocks=soup.find('div',attrs=['id':'newsblocks']) soup.find('div',attrs={'id':'newsbody'}).insert(-1, newsblocks) return soup Last edited by marbs; 11-06-2010 at 04:08 PM. |
11-08-2010, 02:57 AM | #15 |
Member
Posts: 15
Karma: 10
Join Date: Oct 2010
Location: Slovenia
Device: Kindle 3G
|
I'm trying with it, but I always get error "invalid syntax" in line 59. The last command "return soup" is in line 58.
I tried putting space, tab, paragraph brake... without success. I guess there is error in some missing "space". |