Help with print_url and/or split_url

BlonG · 11-03-2010, 04:47 AM

Since I'm a newbie I try to learn by examples I find here. I created a recipe, but have a problem with "unexpected indent" error in part with print_version.

The task is (should be) simple: replace article URL with print version URL.

FROM: ****://www.rtvslo.si/svet/republikanci-z-vecino-v-predstavniskem-domu-senat-ostaja-demokratom/243020
TO: ****://www.rtvslo.si/index.php?c_mod=news&op=print&id=243020

When I try to add/update recipe, I get error mentioned above in line 56 (It's the: return print_url).

Can someone please take a look and help me out, please.

Code:

__license__ = 'GPL v3'
__copyright__ = '2010, BlonG'
'''
www.rtvslo.si
'''
from calibre.web.feeds.news import BasicNewsRecipe
class MMCRTV(BasicNewsRecipe):
  title = u'MMC RTV'
  __author__ = u'BlonG'
# 10
  description = u"Prvi interaktivni multimedijski portal, MMC RTV Slovenija"
  oldest_article = 3
  max_articles_per_feed = 20
  encoding = 'cp1250'
  language = 'sl'
  no_stylesheets = True
  use_embedded_content = False

  cover_url = 'http://img.rtvslo.si/_static/images/rtvportal_logo.png'
# 20
  extra_css = '''
	h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
	h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
	p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
	body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
	'''

  html2lrf_options = ['--base-font-size', '10']

# 30
# keep_only_tags = [
# 	dict(name='div', attrs={'id':'contents'}),
#	dict(name='div', attrs={'class':'entry-content'}),
#	]

#  remove_tags = [
#	dict(name='div', attrs={'class':'fb_article_top'}),
#	dict(name='div', attrs={'class':'related'}),
#	dict(name='div', attrs={'class':'fb_article_foot'}),
# 40
#	dict(name='div', attrs={'class':'spreading'}),
#	dict(name='dl', attrs={'class':'ad'}),
# 	dict(name='p', attrs={'class':'report'}),
#	dict(name='div', attrs={'class':'hfeed comments'}),
#	dict(name='dl', attrs={'id':'entryPanel'}),
#	dict(name='dl', attrs={'class':'infopush ip_wide'}),
#	dict(name='div', attrs={'class':'sidebar'}),
#	dict(name='dl', attrs={'class':'bottom'}),
#	dict(name='div', attrs={'id':'footer'}),
# 50
#	]

    def print_version(self, url):
	split_url = url.split("/")
	print_url = 'http://www.rtvslo.si/index.php?c_mod=news&op=print&id=' +  split_url[1]
	return print_url

    feeds = [
	(u'Vse novice', u'http://www.rtvslo.si/feeds/00.xml')
	,(u'Okolje', u'http://www.rtvslo.si/feeds/12.xml')
	,(u'Znanost in tehnologija', u'http://www.rtvslo.si/feeds/09.xml')
	,(u'Zabava', u'http://www.rtvslo.si/feeds/06.xml')
	]

Starson17 · 11-03-2010, 03:59 PM

Quote:

Originally Posted by BlonG

When I try to add/update recipe, I get error mentioned above in line 56 (It's the: return print_url).

Can someone please take a look and help me out, please.

You have mixed tabs and spaces. Delete all the tabs preceding the lines after def print_version(self, url): and replace with spaces. I use an editor (UltraEdit) that does this by default.

BlonG · 11-04-2010, 03:05 AM

Thank you very much!

(I'm step closer to final version of recipe.)

I did come to next two "challenges". But let's take it step by step.

In recipe I use this code:

Code:

def print_version(self, url):
    split_url = url.split("/")
    print 'URL1= ', split_url[1]
    print 'URL2= ', split_url[2]
    print 'URL3= ', split_url[3]
    print 'URL4= ', split_url[4]
    print 'URL5= ', split_url[5]
    print 'URL6= ', split_url[6]
    print_url = 'http://www.rtvslo.si/index.php?c_mod=news&op=print&id=' + split_url[6]
    print 'THIS URL WILL PRINT: ',print_url
    return print_url

With this I found that ID of the article (six digit number) – needed for print version – is not always on same "place". Usually it’s under syntax split_url[6], but sometimes it’s at split_url[5].

So, the links to articles are like either this:

http://something1/something2/something3/something4/something5/###### (ID is sixth segment)
http://something1/something2/something3/something4/###### (ID is fifth segment)

I looked through forum, but couldn’t find similar problem or solution. I understand logic, but don’t know the syntax for it, so please help me:

if split_url[6] is empty (meaning that link has only 5 segments. Or - another idea - maybe somehow to check if split_url[6] is number?)
print_url = ‘http://…id=’ + split_url[5] (create link to print version of article with 5th segment)
or else
print_url = ‘http://…id=’ + split_url[6] (create link to print version of article with 6th segment)
return print_url

I hope this is possible, I just don't know how.

marbs · 11-04-2010, 04:18 AM

try using print_url = 'http://www.rtvslo.si/index.php?c_mod=news&op=print&id=' + split_url[-1]

that will give you the last split every time...

BlonG · 11-04-2010, 04:43 AM

Marbs - thank you! I didn't know that it can be that simple...

I busted my head with this:

Code:

    def print_version(self, url):
	split_url = url.split("/")
	if len(split_url[5]) == 6: # Check if segment 5 contains six digit ID of article
		print_url = 'http://www.rtvslo.si/index.php?c_mod=news&op=print&id=' + split_url[5]
		print 'URL 5 Print: ',print_url # show URL to printed version of article in log file
	elif len(split_url[6]) == 6: # If segment 5 has no ID then check if segment 6 contains six digit ID of article
		print_url = 'http://www.rtvslo.si/index.php?c_mod=news&op=print&id=' + split_url[6]
		print 'URL 6 Print: ',print_url # show URL to printed version of article in log file
	else: #If segment 5 and segment 6 contain no ID then...
		print 'URL error: ', print_url # show error message
	return print_url

I used your suggestion - much "cleaner" recipe.

marbs · 11-04-2010, 04:55 AM

i'm happy i can help, its not like i know that much.

BlonG · 11-04-2010, 07:30 AM

@marbs: we're all learning...

I'm just starting with building recipes and I'm happy when more experienced users help me.

OK, my final challenge for this recipe (hopefully).

The printed version of article is formated like this: http://www.rtvslo.si/index.php?c_mod...rint&id=243073

top left: Category
top right: "Print" button
left column: title, date of publishing, article/text
right column (in a box): additional images, quotes, etc.

Please follow link above to see what I mean.

In Kindle (or Calibre viewer) order is:

Category
"Print" button
additional images, quotes, etc.
title, date of publishing, article/text

I don't mind much about category and print buttom, but i'd like to see/read left column first and after that right column.

So far I figured out the HTML structure:

Code:

<div id="newsbody"><div id="newsblocks" class="fr tac"...> everything here is content of right column </div>

from here is the content of left column
</div>

Is there any way to move the part <div id="newsblocks" class="fr tac"...> </div> to the end of HTML when creating article?

marbs · 11-04-2010, 08:00 AM

i ran in to this once: keep only tags keeps tags in the order you write them. what i would do is to "keep only" the tags that you want, hoping that keep tags is stronger than "remove tags" then i would remove the news blocks inside news body. something like this:

Code:

 keep_only_tags = [
 	dict(name='div', attrs={'class':'title'}),
	dict(name='div', attrs={'id':'newsbody'}),
        dict(name='div', attrs={'id':'newsblocks'}),
	]
 remove_tags=[
        dict(name='div', attrs={'id':'newsblocks'}),
        ]

if that does not work, which it probably will not, you need to postprocess_html

BlonG · 11-04-2010, 10:14 AM

Hm... I was thinkig of something similar.

Like this:
1. take original article and remove 'newsblock' tag (basicly remove right column) > the result: article[1]
2. take original article and remove everything but the 'newsblock' tag (keep only right column) > the result: article[2]

Combine in this order: article[1] + article[2].

Now...

how to do that?

marbs · 11-04-2010, 10:57 AM

what did my idea do?
you can edit the soup.
read this

BlonG · 11-05-2010, 02:31 AM

Quote:

Originally Posted by marbs

what did my idea do?

I tested it today. It removes the right column completely.

I seems I'll have to edit the soup, so I will need more time, because I don't understand everything.

marbs · 11-05-2010, 05:45 AM

you can edit the html if you know how to do that...

BlonG · 11-06-2010, 08:25 AM

I have read the link you provided (The Beautiful Soup documentation), but could't come up with solution. Maybe the answer is in this part. However - I don't understand this programing. (What or how to use tags instead of text. In the HTML structure I posted above I see "tag" only for right column, left column in not in "tag".)

The more I read the more I have feeling that solution is to create two temp articles (one withous right column and one with only right column). And then combine two temp articles one after the other. Still: I don't know hot to put this into commands in Calibre.

@marbs: I know a little bit about editing HTML - but I'm not sure what have you in mind.

Anyway - any help with this is very appreciated!

marbs · 11-06-2010, 02:14 PM

but try coping this as is to your code:

Code:

    def preprocess_html(self, soup):
        newsbody= soup.find('div',attrs={'id':'newsbody'})
        newsblocks=nesbody.find('div',attrs=['id':'newsblocks'])
        newsbody.insert(-1, newsblocks)
        return soup

edit:
i thought of it again, you may want to try this instead:

Code:

    def preprocess_html(self, soup):
        newsblocks=soup.find('div',attrs=['id':'newsblocks'])
        soup.find('div',attrs={'id':'newsbody'}).insert(-1, newsblocks)
        return soup

tell me which one of them worked (if at all)

BlonG · 11-08-2010, 02:57 AM

I'm trying with it, but I always get error "invalid syntax" in line 59. The last command "return soup" is in line 58.

I tried putting space, tab, paragraph brake...

without success. I guess there is error in some missing "space".

11-04-2010, 03:05 AM	#3
BlonG Member Posts: 15 Karma: 10 Join Date: Oct 2010 Location: Slovenia Device: Kindle 3G	Thank you very much! (I'm step closer to final version of recipe.) I did come to next two "challenges". But let's take it step by step. In recipe I use this code: Code: def print_version(self, url): split_url = url.split("/") print 'URL1= ', split_url[1] print 'URL2= ', split_url[2] print 'URL3= ', split_url[3] print 'URL4= ', split_url[4] print 'URL5= ', split_url[5] print 'URL6= ', split_url[6] print_url = 'http://www.rtvslo.si/index.php?c_mod=news&op=print&id=' + split_url[6] print 'THIS URL WILL PRINT: ',print_url return print_url With this I found that ID of the article (six digit number) – needed for print version – is not always on same "place". Usually it’s under syntax split_url[6], but sometimes it’s at split_url[5]. So, the links to articles are like either this: http://something1/something2/something3/something4/something5/###### (ID is sixth segment) http://something1/something2/something3/something4/###### (ID is fifth segment) I looked through forum, but couldn’t find similar problem or solution. I understand logic, but don’t know the syntax for it, so please help me: if split_url[6] is empty (meaning that link has only 5 segments. Or - another idea - maybe somehow to check if split_url[6] is number?) print_url = ‘http://…id=’ + split_url[5] (create link to print version of article with 5th segment) or else print_url = ‘http://…id=’ + split_url[6] (create link to print version of article with 6th segment) return print_url I hope this is possible, I just don't know how.

11-04-2010, 07:30 AM	#7
BlonG Member Posts: 15 Karma: 10 Join Date: Oct 2010 Location: Slovenia Device: Kindle 3G	Order of tags inside HTML @marbs: we're all learning... I'm just starting with building recipes and I'm happy when more experienced users help me. OK, my final challenge for this recipe (hopefully). The printed version of article is formated like this: http://www.rtvslo.si/index.php?c_mod...rint&id=243073 top left: Category top right: "Print" button left column: title, date of publishing, article/text right column (in a box): additional images, quotes, etc. Please follow link above to see what I mean. In Kindle (or Calibre viewer) order is: Category "Print" button additional images, quotes, etc. title, date of publishing, article/text I don't mind much about category and print buttom, but i'd like to see/read left column first and after that right column. So far I figured out the HTML structure: Code: <div id="newsbody"> <div id="newsblocks" class="fr tac"...> *everything here is content of right column* </div> *from here is the content of left column* </div> Is there any way to move the part <div id="newsblocks" class="fr tac"...> </div> to the end of HTML when creating article?

11-04-2010, 08:00 AM	#8
marbs Zealot Posts: 122 Karma: 10 Join Date: Jul 2010 Device: nook	i ran in to this once: keep only tags keeps tags in the order you write them. what i would do is to "keep only" the tags that you want, hoping that keep tags is stronger than "remove tags" then i would remove the news blocks inside news body. something like this: Code: keep_only_tags = [ dict(name='div', attrs={'class':'title'}), dict(name='div', attrs={'id':'newsbody'}), dict(name='div', attrs={'id':'newsblocks'}), ] remove_tags=[ dict(name='div', attrs={'id':'newsblocks'}), ] if that does not work, which it probably will not, you need to postprocess_html Last edited by marbs; 11-04-2010 at 08:11 AM.

11-06-2010, 02:14 PM	#14
marbs Zealot Posts: 122 Karma: 10 Join Date: Jul 2010 Device: nook	i havent had much luck with preprocess_html but try coping this as is to your code: Code: def preprocess_html(self, soup): newsbody= soup.find('div',attrs={'id':'newsbody'}) newsblocks=nesbody.find('div',attrs=['id':'newsblocks']) newsbody.insert(-1, newsblocks) return soup edit: i thought of it again, you may want to try this instead: Code: def preprocess_html(self, soup): newsblocks=soup.find('div',attrs=['id':'newsblocks']) soup.find('div',attrs={'id':'newsbody'}).insert(-1, newsblocks) return soup tell me which one of them worked (if at all) Last edited by marbs; 11-06-2010 at 04:08 PM.

11-04-2010, 04:18 AM	#4
marbs Zealot Posts: 122 Karma: 10 Join Date: Jul 2010 Device: nook	try using print_url = 'http://www.rtvslo.si/index.php?c_mod=news&op=print&id=' + split_url[-1] that will give you the last split every time...

11-04-2010, 04:55 AM	#6
marbs Zealot Posts: 122 Karma: 10 Join Date: Jul 2010 Device: nook	i'm happy i can help, its not like i know that much.

11-04-2010, 10:14 AM	#9
BlonG Member Posts: 15 Karma: 10 Join Date: Oct 2010 Location: Slovenia Device: Kindle 3G	Hm... I was thinkig of something similar. Like this: 1. take original article and remove 'newsblock' tag (basicly remove right column) > the result: article[1] 2. take original article and remove everything but the 'newsblock' tag (keep only right column) > the result: article[2] Combine in this order: article[1] + article[2]. Now... how to do that?

11-04-2010, 10:57 AM	#10
marbs Zealot Posts: 122 Karma: 10 Join Date: Jul 2010 Device: nook	what did my idea do? you can edit the soup. read this

11-05-2010, 05:45 AM	#12
marbs Zealot Posts: 122 Karma: 10 Join Date: Jul 2010 Device: nook	you can edit the html if you know how to do that...

11-06-2010, 08:25 AM	#13
BlonG Member Posts: 15 Karma: 10 Join Date: Oct 2010 Location: Slovenia Device: Kindle 3G	I have read the link you provided (The Beautiful Soup documentation), but could't come up with solution. Maybe the answer is in this part. However - I don't understand this programing. (What or how to use tags instead of text. In the HTML structure I posted above I see "tag" only for right column, left column in not in "tag".) The more I read the more I have feeling that solution is to create two temp articles (one withous right column and one with only right column). And then combine two temp articles one after the other. Still: I don't know hot to put this into commands in Calibre. @marbs: I know a little bit about editing HTML - but I'm not sure what have you in mind. Anyway - any help with this is very appreciated!

11-08-2010, 02:57 AM	#15
BlonG Member Posts: 15 Karma: 10 Join Date: Oct 2010 Location: Slovenia Device: Kindle 3G	I'm trying with it, but I always get error "invalid syntax" in line 59. The last command "return soup" is in line 58. I tried putting space, tab, paragraph brake... without success. I guess there is error in some missing "space".

Advert

Advert