Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 11-03-2010, 04:47 AM   #1
BlonG
Member
BlonG began at the beginning.
 
BlonG's Avatar
 
Posts: 15
Karma: 10
Join Date: Oct 2010
Location: Slovenia
Device: Kindle 3G
Help with print_url and/or split_url

Since I'm a newbie I try to learn by examples I find here. I created a recipe, but have a problem with "unexpected indent" error in part with print_version.

The task is (should be) simple: replace article URL with print version URL.
  • FROM: ****://www.rtvslo.si/svet/republikanci-z-vecino-v-predstavniskem-domu-senat-ostaja-demokratom/243020
  • TO: ****://www.rtvslo.si/index.php?c_mod=news&op=print&id=243020

When I try to add/update recipe, I get error mentioned above in line 56 (It's the: return print_url).

Can someone please take a look and help me out, please.

Code:
__license__ = 'GPL v3'
__copyright__ = '2010, BlonG'
'''
www.rtvslo.si
'''
from calibre.web.feeds.news import BasicNewsRecipe
class MMCRTV(BasicNewsRecipe):
  title = u'MMC RTV'
  __author__ = u'BlonG'
# 10
  description = u"Prvi interaktivni multimedijski portal, MMC RTV Slovenija"
  oldest_article = 3
  max_articles_per_feed = 20
  encoding = 'cp1250'
  language = 'sl'
  no_stylesheets = True
  use_embedded_content = False

  cover_url = 'http://img.rtvslo.si/_static/images/rtvportal_logo.png'
# 20
  extra_css = '''
	h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;}
	h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;}
	p{font-family:Arial,Helvetica,sans-serif;font-size:small;}
	body{font-family:Helvetica,Arial,sans-serif;font-size:small;}
	'''

  html2lrf_options = ['--base-font-size', '10']

# 30
# keep_only_tags = [
# 	dict(name='div', attrs={'id':'contents'}),
#	dict(name='div', attrs={'class':'entry-content'}),
#	]

#  remove_tags = [
#	dict(name='div', attrs={'class':'fb_article_top'}),
#	dict(name='div', attrs={'class':'related'}),
#	dict(name='div', attrs={'class':'fb_article_foot'}),
# 40
#	dict(name='div', attrs={'class':'spreading'}),
#	dict(name='dl', attrs={'class':'ad'}),
# 	dict(name='p', attrs={'class':'report'}),
#	dict(name='div', attrs={'class':'hfeed comments'}),
#	dict(name='dl', attrs={'id':'entryPanel'}),
#	dict(name='dl', attrs={'class':'infopush ip_wide'}),
#	dict(name='div', attrs={'class':'sidebar'}),
#	dict(name='dl', attrs={'class':'bottom'}),
#	dict(name='div', attrs={'id':'footer'}),
# 50
#	]

    def print_version(self, url):
	split_url = url.split("/")
	print_url = 'http://www.rtvslo.si/index.php?c_mod=news&op=print&id=' +  split_url[1]
	return print_url

    feeds = [
	(u'Vse novice', u'http://www.rtvslo.si/feeds/00.xml')
	,(u'Okolje', u'http://www.rtvslo.si/feeds/12.xml')
	,(u'Znanost in tehnologija', u'http://www.rtvslo.si/feeds/09.xml')
	,(u'Zabava', u'http://www.rtvslo.si/feeds/06.xml')
	]
BlonG is offline   Reply With Quote
Old 11-03-2010, 03:59 PM   #2
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by BlonG View Post
When I try to add/update recipe, I get error mentioned above in line 56 (It's the: return print_url).

Can someone please take a look and help me out, please.
You have mixed tabs and spaces. Delete all the tabs preceding the lines after def print_version(self, url): and replace with spaces. I use an editor (UltraEdit) that does this by default.
Starson17 is offline   Reply With Quote
Advert
Old 11-04-2010, 03:05 AM   #3
BlonG
Member
BlonG began at the beginning.
 
BlonG's Avatar
 
Posts: 15
Karma: 10
Join Date: Oct 2010
Location: Slovenia
Device: Kindle 3G
Thank you very much! (I'm step closer to final version of recipe.)

I did come to next two "challenges". But let's take it step by step.

In recipe I use this code:
Code:
def print_version(self, url):
    split_url = url.split("/")
    print 'URL1= ', split_url[1]
    print 'URL2= ', split_url[2]
    print 'URL3= ', split_url[3]
    print 'URL4= ', split_url[4]
    print 'URL5= ', split_url[5]
    print 'URL6= ', split_url[6]
    print_url = 'http://www.rtvslo.si/index.php?c_mod=news&op=print&id=' + split_url[6]
    print 'THIS URL WILL PRINT: ',print_url
    return print_url
With this I found that ID of the article (six digit number) – needed for print version – is not always on same "place". Usually it’s under syntax split_url[6], but sometimes it’s at split_url[5].

So, the links to articles are like either this:
  • http://something1/something2/something3/something4/something5/###### (ID is sixth segment)
  • http://something1/something2/something3/something4/###### (ID is fifth segment)

I looked through forum, but couldn’t find similar problem or solution. I understand logic, but don’t know the syntax for it, so please help me:
if split_url[6] is empty (meaning that link has only 5 segments. Or - another idea - maybe somehow to check if split_url[6] is number?)
print_url = ‘http://…id=’ + split_url[5] (create link to print version of article with 5th segment)
or else
print_url = ‘http://…id=’ + split_url[6] (create link to print version of article with 6th segment)
return print_url
I hope this is possible, I just don't know how.
BlonG is offline   Reply With Quote
Old 11-04-2010, 04:18 AM   #4
marbs
Zealot
marbs began at the beginning.
 
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
try using print_url = 'http://www.rtvslo.si/index.php?c_mod=news&op=print&id=' + split_url[-1]

that will give you the last split every time...
marbs is offline   Reply With Quote
Old 11-04-2010, 04:43 AM   #5
BlonG
Member
BlonG began at the beginning.
 
BlonG's Avatar
 
Posts: 15
Karma: 10
Join Date: Oct 2010
Location: Slovenia
Device: Kindle 3G
Marbs - thank you! I didn't know that it can be that simple...

I busted my head with this:
Code:
    def print_version(self, url):
	split_url = url.split("/")
	if len(split_url[5]) == 6: # Check if segment 5 contains six digit ID of article
		print_url = 'http://www.rtvslo.si/index.php?c_mod=news&op=print&id=' + split_url[5]
		print 'URL 5 Print: ',print_url # show URL to printed version of article in log file
	elif len(split_url[6]) == 6: # If segment 5 has no ID then check if segment 6 contains six digit ID of article
		print_url = 'http://www.rtvslo.si/index.php?c_mod=news&op=print&id=' + split_url[6]
		print 'URL 6 Print: ',print_url # show URL to printed version of article in log file
	else: #If segment 5 and segment 6 contain no ID then...
		print 'URL error: ', print_url # show error message
	return print_url
I used your suggestion - much "cleaner" recipe.

Last edited by BlonG; 11-04-2010 at 04:53 AM.
BlonG is offline   Reply With Quote
Advert
Old 11-04-2010, 04:55 AM   #6
marbs
Zealot
marbs began at the beginning.
 
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
i'm happy i can help, its not like i know that much.
marbs is offline   Reply With Quote
Old 11-04-2010, 07:30 AM   #7
BlonG
Member
BlonG began at the beginning.
 
BlonG's Avatar
 
Posts: 15
Karma: 10
Join Date: Oct 2010
Location: Slovenia
Device: Kindle 3G
Order of tags inside HTML

@marbs: we're all learning... I'm just starting with building recipes and I'm happy when more experienced users help me.



OK, my final challenge for this recipe (hopefully).

The printed version of article is formated like this: http://www.rtvslo.si/index.php?c_mod...rint&id=243073
  1. top left: Category
  2. top right: "Print" button
  3. left column: title, date of publishing, article/text
  4. right column (in a box): additional images, quotes, etc.
Please follow link above to see what I mean.

In Kindle (or Calibre viewer) order is:
  1. Category
  2. "Print" button
  3. additional images, quotes, etc.
  4. title, date of publishing, article/text

I don't mind much about category and print buttom, but i'd like to see/read left column first and after that right column.

So far I figured out the HTML structure:
Code:
<div id="newsbody">
<div id="newsblocks" class="fr tac"...>
everything here is content of right column
</div> from here is the content of left column
</div>
Is there any way to move the part <div id="newsblocks" class="fr tac"...> </div> to the end of HTML when creating article?
BlonG is offline   Reply With Quote
Old 11-04-2010, 08:00 AM   #8
marbs
Zealot
marbs began at the beginning.
 
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
i ran in to this once: keep only tags keeps tags in the order you write them. what i would do is to "keep only" the tags that you want, hoping that keep tags is stronger than "remove tags" then i would remove the news blocks inside news body. something like this:
Code:
 keep_only_tags = [
 	dict(name='div', attrs={'class':'title'}),
	dict(name='div', attrs={'id':'newsbody'}),
        dict(name='div', attrs={'id':'newsblocks'}),
	]
 remove_tags=[
        dict(name='div', attrs={'id':'newsblocks'}),
        ]
if that does not work, which it probably will not, you need to postprocess_html

Last edited by marbs; 11-04-2010 at 08:11 AM.
marbs is offline   Reply With Quote
Old 11-04-2010, 10:14 AM   #9
BlonG
Member
BlonG began at the beginning.
 
BlonG's Avatar
 
Posts: 15
Karma: 10
Join Date: Oct 2010
Location: Slovenia
Device: Kindle 3G
Hm... I was thinkig of something similar.

Like this:
1. take original article and remove 'newsblock' tag (basicly remove right column) > the result: article[1]
2. take original article and remove everything but the 'newsblock' tag (keep only right column) > the result: article[2]

Combine in this order: article[1] + article[2].

Now... how to do that?
BlonG is offline   Reply With Quote
Old 11-04-2010, 10:57 AM   #10
marbs
Zealot
marbs began at the beginning.
 
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
what did my idea do?
you can edit the soup.
read this
marbs is offline   Reply With Quote
Old 11-05-2010, 02:31 AM   #11
BlonG
Member
BlonG began at the beginning.
 
BlonG's Avatar
 
Posts: 15
Karma: 10
Join Date: Oct 2010
Location: Slovenia
Device: Kindle 3G
Quote:
Originally Posted by marbs View Post
what did my idea do?
I tested it today. It removes the right column completely.

I seems I'll have to edit the soup, so I will need more time, because I don't understand everything.
BlonG is offline   Reply With Quote
Old 11-05-2010, 05:45 AM   #12
marbs
Zealot
marbs began at the beginning.
 
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
you can edit the html if you know how to do that...
marbs is offline   Reply With Quote
Old 11-06-2010, 08:25 AM   #13
BlonG
Member
BlonG began at the beginning.
 
BlonG's Avatar
 
Posts: 15
Karma: 10
Join Date: Oct 2010
Location: Slovenia
Device: Kindle 3G
I have read the link you provided (The Beautiful Soup documentation), but could't come up with solution. Maybe the answer is in this part. However - I don't understand this programing. (What or how to use tags instead of text. In the HTML structure I posted above I see "tag" only for right column, left column in not in "tag".)

The more I read the more I have feeling that solution is to create two temp articles (one withous right column and one with only right column). And then combine two temp articles one after the other. Still: I don't know hot to put this into commands in Calibre.

@marbs: I know a little bit about editing HTML - but I'm not sure what have you in mind.

Anyway - any help with this is very appreciated!
BlonG is offline   Reply With Quote
Old 11-06-2010, 02:14 PM   #14
marbs
Zealot
marbs began at the beginning.
 
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
i havent had much luck with preprocess_html

but try coping this as is to your code:

Code:
    def preprocess_html(self, soup):
        newsbody= soup.find('div',attrs={'id':'newsbody'})
        newsblocks=nesbody.find('div',attrs=['id':'newsblocks'])
        newsbody.insert(-1, newsblocks)
        return soup
edit:
i thought of it again, you may want to try this instead:

Code:
    def preprocess_html(self, soup):
        newsblocks=soup.find('div',attrs=['id':'newsblocks'])
        soup.find('div',attrs={'id':'newsbody'}).insert(-1, newsblocks)
        return soup
tell me which one of them worked (if at all)

Last edited by marbs; 11-06-2010 at 04:08 PM.
marbs is offline   Reply With Quote
Old 11-08-2010, 02:57 AM   #15
BlonG
Member
BlonG began at the beginning.
 
BlonG's Avatar
 
Posts: 15
Karma: 10
Join Date: Oct 2010
Location: Slovenia
Device: Kindle 3G
I'm trying with it, but I always get error "invalid syntax" in line 59. The last command "return soup" is in line 58.

I tried putting space, tab, paragraph brake... without success. I guess there is error in some missing "space".
BlonG is offline   Reply With Quote
Reply


Forum Jump


All times are GMT -4. The time now is 12:14 PM.


MobileRead.com is a privately owned, operated and funded community.