Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Development

Notices

Reply
 
Thread Tools Search this Thread
Old 02-27-2011, 07:13 AM   #1
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,230
Karma: 1334002
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite 3G, iPad 3, iPad Air
Handling of region specific web scraping

In my Goodreads metadata plugin I had a user report an issue which I traced down to scraping a numeric value. On his machine which has an OS set to English but German number settings, the value being presented back from using lxml and Calibre's 'browser' object has no period in it. In particular a rating value of say "3.42" is coming back from the html.tostring(node, 'text',encoding=unicode) call as "342".

Interestingly when he views the web page using his internet browser and sends me the html from that it displays the number as "3.42". So I suspect it is either the html.tostring() call or the browser/lxml libraries which are responsible for giving the different result back - not some sort of regionalisation on the goodreads website.

I have a crude workaround but I'm sure there must be a "proper" way of handling this which would cater for other regional number settings as well, such as commas instead of periods etc?
kiwidude is offline   Reply With Quote
Old 02-27-2011, 07:47 AM   #2
chaley
"chaley", not "charley"
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 5,783
Karma: 1212746
Join Date: Jan 2010
Location: France
Device: Many android devices
None that I know of. I have been frustrated by this before. For example, in France, the number "1,234.56" is written as "1 234,56".

Is perhaps the number being stored as an integer * 100? That would let goodreads avoid all the problems of converting back and forth. They get the number, divide by 100, then hand it to localization.
chaley is offline   Reply With Quote
 
Advertisement
Old 02-27-2011, 08:06 AM   #3
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,230
Karma: 1334002
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite 3G, iPad 3, iPad Air
Yeah, that France example is another of where it will possibly go wrong.

I don't know where in the chain it is breaking down. What I don't understand is why on both our machines our web browsers are rendering exactly the same result of "3.42", yet via the Python libraries to scrape the html it is ending up as "342" on his but "3.42" on mine.

If there isn't an obviously clever way of handling this I'll just stick with a crude approach such as stripping out all non-numeric characters and then dividing by 100.
kiwidude is offline   Reply With Quote
Old 02-27-2011, 10:45 AM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 26,312
Karma: 5382313
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Try sending a Accept-Language header. In this case you can also workaround the problem by check if the number is between 10-100 then diving by 10 and if it is between 100-1000 diving by 100
kovidgoyal is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Are you region-ist? gmw General Discussions 49 12-16-2010 01:53 PM
Book prices according to region shemsha Amazon Kindle 5 08-30-2010 08:20 AM
Region and removing DRM Rumpelteazer ePub 4 10-19-2009 06:43 AM
Region Locked? heb Sony Reader 17 10-15-2007 06:06 PM


All times are GMT -4. The time now is 09:33 PM.


MobileRead.com is a privately owned, operated and funded community.