Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 09-29-2010, 06:46 AM   #1
kinurev
Junior Member
kinurev began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Sep 2010
Location: Brisbane, AU
Device: Kindle
Dealing with double quotes " in URL

Hi guys,

I am totally new to recipes. Last night I tried to create a recipe to fetch Vietnamese news from this website http://tuoitre.vn/Rss/Index.html

I think the recipe works fine until:

Quote:
Could not fetch link http://tuoitre.vn/Van-hoa-Giai-tri/4...u-thuong”.html
Traceback (most recent call last):
File "site-packages/calibre/web/fetch/simple.py", line 422, in process_links
File "site-packages/calibre/web/fetch/simple.py", line 221, in fetch_url
FetchError: Bad Request
My guess of the culprit would be the double quote character in the URL. Can any of you please help me with this? Thanks a lot.

Below is my recipe:

Code:
import re
from calibre.web.feeds.recipes import BasicNewsRecipe

class AdvancedUserRecipe1285594488(BasicNewsRecipe):
	title = u'Tuoi Tre News'
	__author__             = 'kinurev'
	description = 'News from Tuoitre in Vietnamese. '
	timefmt = ' [%a, %d %b, %Y]'
	oldest_article = 7
	max_articles_per_feed = 20
	no_stylesheets         = True
	#delay                  = 1
	use_embedded_content   = False
	encoding               = 'utf8'
	publisher              = 'Tuoitre'
	category               = 'news, Vietnam'
	language               = 'vi'
	publication_type       = 'newsportal'
	extra_css              = 'body{font-family: Verdana, Helvetica, Arial, sans-serif} .pHead{ font-size: medium; color: #5F5F5F; font-weight: bold } .pTitle{ font-size: large; font-weight: bold; margin-top: 0 }'
	preprocess_regexps = [
							(re.compile(r'<P class=pBody>------------------------------.*</body>', re.DOTALL|re.IGNORECASE), lambda match: '</body>'),
						]
	remove_tags_before = dict(id='divContent')
	remove_tags_after = dict(id='divContent')
	remove_attributes = ['width','height']

	feeds          = [
						(u'Ch\xednh tr\u1ecb  - X\xe3 h\u1ed9i', u'http://tuoitre.vn/RssFeeds.aspx?ChannelID=3'), 
						(u'Th\u1ebf gi\u1edbi', u'http://tuoitre.vn/RssFeeds.aspx?ChannelID=2'), 
						(u'Nh\u1ecbp s\u1ed1ng tr\u1ebb', u'http://tuoitre.vn/RssFeeds.aspx?ChannelID=7'), 
						(u'Gi\xe1o d\u1ee5c', u'http://tuoitre.vn/RssFeeds.aspx?ChannelID=13'), 
						(u'Th\u1ec3 thao', u'http://tuoitre.vn/RssFeeds.aspx?ChannelID=14'), 
						(u'V\u0103n h\xf3a  - Gi\u1ea3i tr\xed', u'http://tuoitre.vn/RssFeeds.aspx?ChannelID=10'), 
						(u'Nh\u1ecbp s\u1ed1ng s\u1ed1', u'http://tuoitre.vn/RssFeeds.aspx?ChannelID=16')
					]
kinurev is offline   Reply With Quote
Old 09-29-2010, 07:10 AM   #2
Mike L
Wizard
Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.
 
Mike L's Avatar
 
Posts: 1,479
Karma: 3846231
Join Date: Apr 2009
Location: Edinburgh, Scotland
Device: Kindle 3, Samsung Galaxy
The usual way of dealing with it would be to use &quot; in place of the double-quote. But I'm not a Calibre expert, so can't be sure if it would work in this case.
Mike L is offline   Reply With Quote
Advert
Old 09-29-2010, 02:27 PM   #3
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
maybe this will work:
Code:
return url.replace('\"', '\%22)
TonytheBookworm is offline   Reply With Quote
Old 09-30-2010, 06:39 AM   #4
kinurev
Junior Member
kinurev began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Sep 2010
Location: Brisbane, AU
Device: Kindle
Quote:
Originally Posted by TonytheBookworm View Post
maybe this will work:
Code:
return url.replace('\"', '\%22)
Tried it, and got this:

Quote:
Python function terminated unexpectedly: ("'return' outside function", ('/var/folders/ry/ry9uM5AmH88g7YprtmDiOE+++TI/-Tmp-/calibre_0.7.20_tmp_id3zFW/calibre_0.7.20_O6WgL4_recipes/recipe0.py', 27, None, 'return url.replace(\'\\"\', \'\\%22\')\n'))
Traceback (most recent call last):
File "/Applications/calibre.app/Contents/Resources/Python/lib/python2.6/site.py", line 147, in main
return run_entry_point()
File "/Applications/calibre.app/Contents/Resources/Python/lib/python2.6/site.py", line 116, in run_entry_point
return getattr(pmod, func)()
File "site-packages/calibre/ebooks/conversion/cli.py", line 254, in main
File "site-packages/calibre/ebooks/conversion/plumber.py", line 832, in run
File "site-packages/calibre/customize/conversion.py", line 211, in __call__
File "site-packages/calibre/web/feeds/input.py", line 68, in convert
File "site-packages/calibre/web/feeds/recipes/__init__.py", line 47, in compile_recipe
File "/var/folders/ry/ry9uM5AmH88g7YprtmDiOE+++TI/-Tmp-/calibre_0.7.20_tmp_id3zFW/calibre_0.7.20_O6WgL4_recipes/recipe0.py", line 27
return url.replace('\"', '\%22')
SyntaxError: 'return' outside function
kinurev is offline   Reply With Quote
Old 09-30-2010, 04:26 PM   #5
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Maybe Kovid or Starson or someone else will chime in and answer this for you and I. I don't see why the below doesn't work but that's not saying it does either.
Spoiler:

Code:
def preprocess_html(self, soup):
	 for a in soup.findAll('a'):
	 
	  a['href'] = a['href'].replace(r'(")', "%22")
	  
	 return soup


Basically in the above it SHOULD look for all anchor tags (links) in your soup and then do a regexpression lookup for all instances of " insider the href reference. If it find it replace that value with %22 which is html for a double quote. Again this may not work but I didn't really have anything to test it on other than your code but the code didn't generate any links that had " in it so I wasn't really able to test it. Give a shot and see what happens for you.
TonytheBookworm is offline   Reply With Quote
Advert
Old 10-01-2010, 09:36 AM   #6
Mike L
Wizard
Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.Mike L ought to be getting tired of karma fortunes by now.
 
Mike L's Avatar
 
Posts: 1,479
Karma: 3846231
Join Date: Apr 2009
Location: Edinburgh, Scotland
Device: Kindle 3, Samsung Galaxy
Did you try my suggestion of using &quot; I don't know if it will work, but surely it's worth a try.
Mike L is offline   Reply With Quote
Old 10-03-2010, 09:57 AM   #7
kinurev
Junior Member
kinurev began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Sep 2010
Location: Brisbane, AU
Device: Kindle
Thanks TonytheBookworm for helping me with this. The script seems to work for now but like you said, just when I need an url with double quotes to try, I could not find one.

Well, good news, while writing this I found out that the link

http://tuoitre.vn/Chinh-tri-Xa-hoi/4...-cay-canh.html

and the link

http://tuoitre.vn/Chinh-tri-Xa-hoi/403734/Kiem-lam-va-cong-an-"canh-giu"-doan-xe-tai-cho-cay-canh.html

both worked in my browser (Chrome), and that the script worked fine irrespective of the code you suggested. It seems that the problem solved itself (hopefully for good). I honestly don't know how it happened but thanks a lot for your help anyway. I'll still keep your code in the script, just in case.


@Mike L: thanks for your suggestion as well but I have very little knowledge about python so I just don't know how to use &quot.


Quote:
Originally Posted by TonytheBookworm View Post
Maybe Kovid or Starson or someone else will chime in and answer this for you and I. I don't see why the below doesn't work but that's not saying it does either.
Spoiler:

Code:
def preprocess_html(self, soup):
	 for a in soup.findAll('a'):
	 
	  a['href'] = a['href'].replace(r'(")', "%22")
	  
	 return soup


Basically in the above it SHOULD look for all anchor tags (links) in your soup and then do a regexpression lookup for all instances of " insider the href reference. If it find it replace that value with %22 which is html for a double quote. Again this may not work but I didn't really have anything to test it on other than your code but the code didn't generate any links that had " in it so I wasn't really able to test it. Give a shot and see what happens for you.
kinurev is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Change single quotes to double quotes Elfwreck Workshop 16 04-26-2013 10:06 AM
Single quotes to double quotes? lunixer General Discussions 35 10-10-2010 05:47 AM
0.7.7 converts double "l's" to single stan1 Calibre 3 07-06-2010 03:03 AM
PRS-600 "double tap" bookmark not working MO74 Sony Reader 3 03-24-2010 05:24 AM
Sony's "Connect" Store changes URL NatCh Sony Reader 3 01-15-2008 06:34 PM


All times are GMT -4. The time now is 11:45 PM.


MobileRead.com is a privately owned, operated and funded community.