MobileRead Forums - View Single Post

kiwidude · 02-27-2011, 07:13 AM

In my Goodreads metadata plugin I had a user report an issue which I traced down to scraping a numeric value. On his machine which has an OS set to English but German number settings, the value being presented back from using lxml and Calibre's 'browser' object has no period in it. In particular a rating value of say "3.42" is coming back from the html.tostring(node, 'text',encoding=unicode) call as "342".

Interestingly when he views the web page using his internet browser and sends me the html from that it displays the number as "3.42". So I suspect it is either the html.tostring() call or the browser/lxml libraries which are responsible for giving the different result back - not some sort of regionalisation on the goodreads website.

I have a crude workaround but I'm sure there must be a "proper" way of handling this which would cater for other regional number settings as well, such as commas instead of periods etc?

02-27-2011, 07:13 AM	#1
kiwidude Calibre Plugins Developer Posts: 4,741 Karma: 2208556 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Handling of region specific web scraping In my Goodreads metadata plugin I had a user report an issue which I traced down to scraping a numeric value. On his machine which has an OS set to English but German number settings, the value being presented back from using lxml and Calibre's 'browser' object has no period in it. In particular a rating value of say "3.42" is coming back from the html.tostring(node, 'text',encoding=unicode) call as "342". Interestingly when he views the web page using his internet browser and sends me the html from that it displays the number as "3.42". So I suspect it is either the html.tostring() call or the browser/lxml libraries which are responsible for giving the different result back - not some sort of regionalisation on the goodreads website. I have a crude workaround but I'm sure there must be a "proper" way of handling this which would cater for other regional number settings as well, such as commas instead of periods etc?