|  11-03-2010, 04:47 AM | #1 | 
| Member  Posts: 15 Karma: 10 Join Date: Oct 2010 Location: Slovenia Device: Kindle 3G | 
				
				Help with print_url and/or split_url
			 
			
			Since I'm a newbie I try to learn by examples I find here. I created a recipe, but have a problem with "unexpected indent" error in part with print_version.  The task is (should be) simple: replace article URL with print version URL. 
 When I try to add/update recipe, I get error mentioned above in line 56 (It's the: return print_url). Can someone please take a look and help me out, please. Code: __license__ = 'GPL v3'
__copyright__ = '2010, BlonG'
'''
www.rtvslo.si
'''
from calibre.web.feeds.news import BasicNewsRecipe
class MMCRTV(BasicNewsRecipe):
  title = u'MMC RTV'
  __author__ = u'BlonG'
# 10
  description = u"Prvi interaktivni multimedijski portal, MMC RTV Slovenija"
  oldest_article = 3
  max_articles_per_feed = 20
  encoding = 'cp1250'
  language = 'sl'
  no_stylesheets = True
  use_embedded_content = False
  cover_url = 'http://img.rtvslo.si/_static/images/rtvportal_logo.png'
# 20
  extra_css = '''
	h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
	h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
	p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
	body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
	'''
  html2lrf_options = ['--base-font-size', '10']
# 30
# keep_only_tags = [
# 	dict(name='div', attrs={'id':'contents'}),
#	dict(name='div', attrs={'class':'entry-content'}),
#	]
#  remove_tags = [
#	dict(name='div', attrs={'class':'fb_article_top'}),
#	dict(name='div', attrs={'class':'related'}),
#	dict(name='div', attrs={'class':'fb_article_foot'}),
# 40
#	dict(name='div', attrs={'class':'spreading'}),
#	dict(name='dl', attrs={'class':'ad'}),
# 	dict(name='p', attrs={'class':'report'}),
#	dict(name='div', attrs={'class':'hfeed comments'}),
#	dict(name='dl', attrs={'id':'entryPanel'}),
#	dict(name='dl', attrs={'class':'infopush ip_wide'}),
#	dict(name='div', attrs={'class':'sidebar'}),
#	dict(name='dl', attrs={'class':'bottom'}),
#	dict(name='div', attrs={'id':'footer'}),
# 50
#	]
    def print_version(self, url):
	split_url = url.split("/")
	print_url = 'http://www.rtvslo.si/index.php?c_mod=news&op=print&id=' +  split_url[1]
	return print_url
    feeds = [
	(u'Vse novice', u'http://www.rtvslo.si/feeds/00.xml')
	,(u'Okolje', u'http://www.rtvslo.si/feeds/12.xml')
	,(u'Znanost in tehnologija', u'http://www.rtvslo.si/feeds/09.xml')
	,(u'Zabava', u'http://www.rtvslo.si/feeds/06.xml')
	] | 
|   |   | 
|  11-03-2010, 03:59 PM | #2 | 
| Wizard            Posts: 4,004 Karma: 177841 Join Date: Dec 2009 Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T | 
			
			You have mixed tabs and spaces.  Delete all the tabs preceding the lines after def print_version(self, url): and replace with spaces.  I use an editor (UltraEdit) that does this by default.
		 | 
|   |   | 
|  11-04-2010, 03:05 AM | #3 | 
| Member  Posts: 15 Karma: 10 Join Date: Oct 2010 Location: Slovenia Device: Kindle 3G | 
			
			Thank you very much!   (I'm step closer to final version of recipe.) I did come to next two "challenges". But let's take it step by step. In recipe I use this code: Code: def print_version(self, url):
    split_url = url.split("/")
    print 'URL1= ', split_url[1]
    print 'URL2= ', split_url[2]
    print 'URL3= ', split_url[3]
    print 'URL4= ', split_url[4]
    print 'URL5= ', split_url[5]
    print 'URL6= ', split_url[6]
    print_url = 'http://www.rtvslo.si/index.php?c_mod=news&op=print&id=' + split_url[6]
    print 'THIS URL WILL PRINT: ',print_url
    return print_urlSo, the links to articles are like either this: 
 I looked through forum, but couldn’t find similar problem or solution. I understand logic, but don’t know the syntax for it, so please help me: if split_url[6] is empty (meaning that link has only 5 segments. Or - another idea - maybe somehow to check if split_url[6] is number?)I hope this is possible, I just don't know how. print_url = ‘http://…id=’ + split_url[5] (create link to print version of article with 5th segment) or else print_url = ‘http://…id=’ + split_url[6] (create link to print version of article with 6th segment) return print_url | 
|   |   | 
|  11-04-2010, 04:18 AM | #4 | 
| Zealot  Posts: 122 Karma: 10 Join Date: Jul 2010 Device: nook | 
			
			try using     print_url = 'http://www.rtvslo.si/index.php?c_mod=news&op=print&id=' + split_url[-1] that will give you the last split every time... | 
|   |   | 
|  11-04-2010, 04:43 AM | #5 | 
| Member  Posts: 15 Karma: 10 Join Date: Oct 2010 Location: Slovenia Device: Kindle 3G | 
			
			Marbs - thank you! I didn't know that it can be that simple...   I busted my head with this: Code:     def print_version(self, url):
	split_url = url.split("/")
	if len(split_url[5]) == 6: # Check if segment 5 contains six digit ID of article
		print_url = 'http://www.rtvslo.si/index.php?c_mod=news&op=print&id=' + split_url[5]
		print 'URL 5 Print: ',print_url # show URL to printed version of article in log file
	elif len(split_url[6]) == 6: # If segment 5 has no ID then check if segment 6 contains six digit ID of article
		print_url = 'http://www.rtvslo.si/index.php?c_mod=news&op=print&id=' + split_url[6]
		print 'URL 6 Print: ',print_url # show URL to printed version of article in log file
	else: #If segment 5 and segment 6 contain no ID then...
		print 'URL error: ', print_url # show error message
	return print_url  Last edited by BlonG; 11-04-2010 at 04:53 AM. | 
|   |   | 
|  11-04-2010, 04:55 AM | #6 | 
| Zealot  Posts: 122 Karma: 10 Join Date: Jul 2010 Device: nook | 
			
			i'm happy i can help, its not like i know that much.
		 | 
|   |   | 
|  11-04-2010, 07:30 AM | #7 | 
| Member  Posts: 15 Karma: 10 Join Date: Oct 2010 Location: Slovenia Device: Kindle 3G | 
				
				Order of tags inside HTML
			 
			
			@marbs: we're all learning...   I'm just starting with building recipes and I'm happy when more experienced users help me. OK, my final challenge for this recipe (hopefully).  The printed version of article is formated like this: http://www.rtvslo.si/index.php?c_mod...rint&id=243073 
 In Kindle (or Calibre viewer) order is: 
 I don't mind much about category and print buttom, but i'd like to see/read left column first and after that right column. So far I figured out the HTML structure: Code: <div id="newsbody"> | 
|   |   | 
|  11-04-2010, 08:00 AM | #8 | 
| Zealot  Posts: 122 Karma: 10 Join Date: Jul 2010 Device: nook | 
			
			i ran in to this once: keep only tags keeps tags in the order you write them. what i would do is to "keep only" the tags that you want, hoping that keep tags is stronger than "remove tags" then i would remove the news blocks inside news body. something like this: Code:  keep_only_tags = [
 	dict(name='div', attrs={'class':'title'}),
	dict(name='div', attrs={'id':'newsbody'}),
        dict(name='div', attrs={'id':'newsblocks'}),
	]
 remove_tags=[
        dict(name='div', attrs={'id':'newsblocks'}),
        ]Last edited by marbs; 11-04-2010 at 08:11 AM. | 
|   |   | 
|  11-04-2010, 10:14 AM | #9 | 
| Member  Posts: 15 Karma: 10 Join Date: Oct 2010 Location: Slovenia Device: Kindle 3G | 
			
			Hm... I was thinkig of something similar.  Like this: 1. take original article and remove 'newsblock' tag (basicly remove right column) > the result: article[1] 2. take original article and remove everything but the 'newsblock' tag (keep only right column) > the result: article[2] Combine in this order: article[1] + article[2]. Now...  how to do that? | 
|   |   | 
|  11-05-2010, 02:31 AM | #11 | 
| Member  Posts: 15 Karma: 10 Join Date: Oct 2010 Location: Slovenia Device: Kindle 3G | |
|   |   | 
|  11-05-2010, 05:45 AM | #12 | 
| Zealot  Posts: 122 Karma: 10 Join Date: Jul 2010 Device: nook | 
			
			you can edit the html if you know how to do that...
		 | 
|   |   | 
|  11-06-2010, 08:25 AM | #13 | 
| Member  Posts: 15 Karma: 10 Join Date: Oct 2010 Location: Slovenia Device: Kindle 3G | 
			
			I have read the link you provided (The Beautiful Soup documentation), but could't come up with solution. Maybe the answer is in this part. However - I don't understand this programing. (What or how to use tags instead of text. In the HTML structure I posted above I see "tag" only for right column, left column in not in "tag".) The more I read the more I have feeling that solution is to create two temp articles (one withous right column and one with only right column). And then combine two temp articles one after the other. Still: I don't know hot to put this into commands in Calibre. @marbs: I know a little bit about editing HTML - but I'm not sure what have you in mind. Anyway - any help with this is very appreciated!   | 
|   |   | 
|  11-06-2010, 02:14 PM | #14 | 
| Zealot  Posts: 122 Karma: 10 Join Date: Jul 2010 Device: nook | 
				
				i havent had much luck with preprocess_html
			 
			
			but try coping this as is to your code: Code:     def preprocess_html(self, soup):
        newsbody= soup.find('div',attrs={'id':'newsbody'})
        newsblocks=nesbody.find('div',attrs=['id':'newsblocks'])
        newsbody.insert(-1, newsblocks)
        return soupi thought of it again, you may want to try this instead: Code:     def preprocess_html(self, soup):
        newsblocks=soup.find('div',attrs=['id':'newsblocks'])
        soup.find('div',attrs={'id':'newsbody'}).insert(-1, newsblocks)
        return soupLast edited by marbs; 11-06-2010 at 04:08 PM. | 
|   |   | 
|  11-08-2010, 02:57 AM | #15 | 
| Member  Posts: 15 Karma: 10 Join Date: Oct 2010 Location: Slovenia Device: Kindle 3G | 
			
			I'm trying with it, but I always get error "invalid syntax" in line 59. The last command "return soup" is in line 58.  I tried putting space, tab, paragraph brake...  without success. I guess there is error in some missing "space". | 
|   |   | 
|  | 
| Thread Tools | Search this Thread | 
| 
 |