Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 09-29-2010, 06:46 AM   #1
kinurev
Junior Member
kinurev began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Sep 2010
Location: Brisbane, AU
Device: Kindle
Dealing with double quotes " in URL

Hi guys,

I am totally new to recipes. Last night I tried to create a recipe to fetch Vietnamese news from this website http://tuoitre.vn/Rss/Index.html

I think the recipe works fine until:

Quote:
Could not fetch link http://tuoitre.vn/Van-hoa-Giai-tri/4...u-thuong”.html
Traceback (most recent call last):
File "site-packages/calibre/web/fetch/simple.py", line 422, in process_links
File "site-packages/calibre/web/fetch/simple.py", line 221, in fetch_url
FetchError: Bad Request
My guess of the culprit would be the double quote character in the URL. Can any of you please help me with this? Thanks a lot.

Below is my recipe:

Code:
import re
from calibre.web.feeds.recipes import BasicNewsRecipe

class AdvancedUserRecipe1285594488(BasicNewsRecipe):
	title = u'Tuoi Tre News'
	__author__             = 'kinurev'
	description = 'News from Tuoitre in Vietnamese. '
	timefmt = ' [%a, %d %b, %Y]'
	oldest_article = 7
	max_articles_per_feed = 20
	no_stylesheets         = True
	#delay                  = 1
	use_embedded_content   = False
	encoding               = 'utf8'
	publisher              = 'Tuoitre'
	category               = 'news, Vietnam'
	language               = 'vi'
	publication_type       = 'newsportal'
	extra_css              = 'body{font-family: Verdana, Helvetica, Arial, sans-serif} .pHead{ font-size: medium; color: #5F5F5F; font-weight: bold } .pTitle{ font-size: large; font-weight: bold; margin-top: 0 }'
	preprocess_regexps = [
							(re.compile(r'<P class=pBody>------------------------------.*</body>', re.DOTALL|re.IGNORECASE), lambda match: '</body>'),
						]
	remove_tags_before = dict(id='divContent')
	remove_tags_after = dict(id='divContent')
	remove_attributes = ['width','height']

	feeds          = [
						(u'Ch\xednh tr\u1ecb  - X\xe3 h\u1ed9i', u'http://tuoitre.vn/RssFeeds.aspx?ChannelID=3'), 
						(u'Th\u1ebf gi\u1edbi', u'http://tuoitre.vn/RssFeeds.aspx?ChannelID=2'), 
						(u'Nh\u1ecbp s\u1ed1ng tr\u1ebb', u'http://tuoitre.vn/RssFeeds.aspx?ChannelID=7'), 
						(u'Gi\xe1o d\u1ee5c', u'http://tuoitre.vn/RssFeeds.aspx?ChannelID=13'), 
						(u'Th\u1ec3 thao', u'http://tuoitre.vn/RssFeeds.aspx?ChannelID=14'), 
						(u'V\u0103n h\xf3a  - Gi\u1ea3i tr\xed', u'http://tuoitre.vn/RssFeeds.aspx?ChannelID=10'), 
						(u'Nh\u1ecbp s\u1ed1ng s\u1ed1', u'http://tuoitre.vn/RssFeeds.aspx?ChannelID=16')
					]
kinurev is offline   Reply With Quote
Old 09-29-2010, 07:10 AM   #2
Mike L
Wizard
Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.
 
Mike L's Avatar
 
Posts: 1,417
Karma: 3818575
Join Date: Apr 2009
Location: Edinburgh, Scotland
Device: Kindle 3, Nexus 7
The usual way of dealing with it would be to use &quot; in place of the double-quote. But I'm not a Calibre expert, so can't be sure if it would work in this case.
Mike L is offline   Reply With Quote
Old 09-29-2010, 02:27 PM   #3
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
maybe this will work:
Code:
return url.replace('\"', '\%22)
TonytheBookworm is offline   Reply With Quote
Old 09-30-2010, 06:39 AM   #4
kinurev
Junior Member
kinurev began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Sep 2010
Location: Brisbane, AU
Device: Kindle
Quote:
Originally Posted by TonytheBookworm View Post
maybe this will work:
Code:
return url.replace('\"', '\%22)
Tried it, and got this:

Quote:
Python function terminated unexpectedly: ("'return' outside function", ('/var/folders/ry/ry9uM5AmH88g7YprtmDiOE+++TI/-Tmp-/calibre_0.7.20_tmp_id3zFW/calibre_0.7.20_O6WgL4_recipes/recipe0.py', 27, None, 'return url.replace(\'\\"\', \'\\%22\')\n'))
Traceback (most recent call last):
File "/Applications/calibre.app/Contents/Resources/Python/lib/python2.6/site.py", line 147, in main
return run_entry_point()
File "/Applications/calibre.app/Contents/Resources/Python/lib/python2.6/site.py", line 116, in run_entry_point
return getattr(pmod, func)()
File "site-packages/calibre/ebooks/conversion/cli.py", line 254, in main
File "site-packages/calibre/ebooks/conversion/plumber.py", line 832, in run
File "site-packages/calibre/customize/conversion.py", line 211, in __call__
File "site-packages/calibre/web/feeds/input.py", line 68, in convert
File "site-packages/calibre/web/feeds/recipes/__init__.py", line 47, in compile_recipe
File "/var/folders/ry/ry9uM5AmH88g7YprtmDiOE+++TI/-Tmp-/calibre_0.7.20_tmp_id3zFW/calibre_0.7.20_O6WgL4_recipes/recipe0.py", line 27
return url.replace('\"', '\%22')
SyntaxError: 'return' outside function
kinurev is offline   Reply With Quote
Old 09-30-2010, 04:26 PM   #5
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Maybe Kovid or Starson or someone else will chime in and answer this for you and I. I don't see why the below doesn't work but that's not saying it does either.
Spoiler:

Code:
def preprocess_html(self, soup):
	 for a in soup.findAll('a'):
	 
	  a['href'] = a['href'].replace(r'(")', "%22")
	  
	 return soup


Basically in the above it SHOULD look for all anchor tags (links) in your soup and then do a regexpression lookup for all instances of " insider the href reference. If it find it replace that value with %22 which is html for a double quote. Again this may not work but I didn't really have anything to test it on other than your code but the code didn't generate any links that had " in it so I wasn't really able to test it. Give a shot and see what happens for you.
TonytheBookworm is offline   Reply With Quote
Old 10-01-2010, 09:36 AM   #6
Mike L
Wizard
Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.
 
Mike L's Avatar
 
Posts: 1,417
Karma: 3818575
Join Date: Apr 2009
Location: Edinburgh, Scotland
Device: Kindle 3, Nexus 7
Did you try my suggestion of using &quot; I don't know if it will work, but surely it's worth a try.
Mike L is offline   Reply With Quote
Old 10-03-2010, 09:57 AM   #7
kinurev
Junior Member
kinurev began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Sep 2010
Location: Brisbane, AU
Device: Kindle
Thanks TonytheBookworm for helping me with this. The script seems to work for now but like you said, just when I need an url with double quotes to try, I could not find one.

Well, good news, while writing this I found out that the link

http://tuoitre.vn/Chinh-tri-Xa-hoi/4...-cay-canh.html

and the link

http://tuoitre.vn/Chinh-tri-Xa-hoi/403734/Kiem-lam-va-cong-an-"canh-giu"-doan-xe-tai-cho-cay-canh.html

both worked in my browser (Chrome), and that the script worked fine irrespective of the code you suggested. It seems that the problem solved itself (hopefully for good). I honestly don't know how it happened but thanks a lot for your help anyway. I'll still keep your code in the script, just in case.


@Mike L: thanks for your suggestion as well but I have very little knowledge about python so I just don't know how to use &quot.


Quote:
Originally Posted by TonytheBookworm View Post
Maybe Kovid or Starson or someone else will chime in and answer this for you and I. I don't see why the below doesn't work but that's not saying it does either.
Spoiler:

Code:
def preprocess_html(self, soup):
	 for a in soup.findAll('a'):
	 
	  a['href'] = a['href'].replace(r'(")', "%22")
	  
	 return soup


Basically in the above it SHOULD look for all anchor tags (links) in your soup and then do a regexpression lookup for all instances of " insider the href reference. If it find it replace that value with %22 which is html for a double quote. Again this may not work but I didn't really have anything to test it on other than your code but the code didn't generate any links that had " in it so I wasn't really able to test it. Give a shot and see what happens for you.
kinurev is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Change single quotes to double quotes Elfwreck Workshop 16 04-26-2013 10:06 AM
Single quotes to double quotes? lunixer General Discussions 35 10-10-2010 05:47 AM
0.7.7 converts double "l's" to single stan1 Calibre 3 07-06-2010 03:03 AM
PRS-600 "double tap" bookmark not working MO74 Sony Reader 3 03-24-2010 05:24 AM
Sony's "Connect" Store changes URL NatCh Sony Reader 3 01-15-2008 06:34 PM


All times are GMT -4. The time now is 08:44 PM.


MobileRead.com is a privately owned, operated and funded community.