02-27-2011, 06:13 AM | #1 |
Calibre Plugins Developer
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Handling of region specific web scraping
In my Goodreads metadata plugin I had a user report an issue which I traced down to scraping a numeric value. On his machine which has an OS set to English but German number settings, the value being presented back from using lxml and Calibre's 'browser' object has no period in it. In particular a rating value of say "3.42" is coming back from the html.tostring(node, 'text',encoding=unicode) call as "342".
Interestingly when he views the web page using his internet browser and sends me the html from that it displays the number as "3.42". So I suspect it is either the html.tostring() call or the browser/lxml libraries which are responsible for giving the different result back - not some sort of regionalisation on the goodreads website. I have a crude workaround but I'm sure there must be a "proper" way of handling this which would cater for other regional number settings as well, such as commas instead of periods etc? |
02-27-2011, 06:47 AM | #2 |
Grand Sorcerer
Posts: 11,741
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
None that I know of. I have been frustrated by this before. For example, in France, the number "1,234.56" is written as "1 234,56".
Is perhaps the number being stored as an integer * 100? That would let goodreads avoid all the problems of converting back and forth. They get the number, divide by 100, then hand it to localization. |
Advert | |
|
02-27-2011, 07:06 AM | #3 |
Calibre Plugins Developer
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Yeah, that France example is another of where it will possibly go wrong.
I don't know where in the chain it is breaking down. What I don't understand is why on both our machines our web browsers are rendering exactly the same result of "3.42", yet via the Python libraries to scrape the html it is ending up as "342" on his but "3.42" on mine. If there isn't an obviously clever way of handling this I'll just stick with a crude approach such as stripping out all non-numeric characters and then dividing by 100. |
02-27-2011, 09:45 AM | #4 |
creator of calibre
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Try sending a Accept-Language header. In this case you can also workaround the problem by check if the number is between 10-100 then diving by 10 and if it is between 100-1000 diving by 100
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Are you region-ist? | gmw | General Discussions | 49 | 12-16-2010 12:53 PM |
Book prices according to region | shemsha | Amazon Kindle | 5 | 08-30-2010 07:20 AM |
Region and removing DRM | Rumpelteazer | ePub | 4 | 10-19-2009 05:43 AM |
Region Locked? | heb | Sony Reader | 17 | 10-15-2007 05:06 PM |