View Single Post
Old 02-27-2011, 07:13 AM   #1
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,230
Karma: 1345754
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite 3G, iPad 3, iPad Air
Handling of region specific web scraping

In my Goodreads metadata plugin I had a user report an issue which I traced down to scraping a numeric value. On his machine which has an OS set to English but German number settings, the value being presented back from using lxml and Calibre's 'browser' object has no period in it. In particular a rating value of say "3.42" is coming back from the html.tostring(node, 'text',encoding=unicode) call as "342".

Interestingly when he views the web page using his internet browser and sends me the html from that it displays the number as "3.42". So I suspect it is either the html.tostring() call or the browser/lxml libraries which are responsible for giving the different result back - not some sort of regionalisation on the goodreads website.

I have a crude workaround but I'm sure there must be a "proper" way of handling this which would cater for other regional number settings as well, such as commas instead of periods etc?
kiwidude is offline   Reply With Quote