View Single Post
Old 07-15-2012, 05:00 AM   #21
kiwidude
calibre/Sigil Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,223
Karma: 1333994
Join Date: Oct 2010
Location: London, UK
Device: Kindle Paperwhite 3G, iPad 3, iPad Air
@Perkin - thanks for giving it a whirl and the feedback. Yeah I briefly mentioned my thoughts about other metadata fields above to ElizabethN - there are two issues with it. The first is the extra clutter it would add to the UI gui for something that is so rarely available in a usable fashion. The second is actually getting a quality source for it. From a CSV file no problem. However from a web page very few pages that display books in a list will put the series information in a reliable structured fashion. Everything becomes very bespoke and series data is ordinarily scraped from the individual page for a book (in fact my FF metadata plugin does not scrape the web page for it - it fires the same database query that is used to construct the page by FF that gets a JSON result). You can see just looking at the FF page the difficulties involved - series name is just placed in a <strong> tag that appears there "sometimes", their HTML is not structured very nicely at all.

Edit - actually getting the series name is not that difficult (though I found a bug in the plugin while doing so) - it is series # that is difficult. Still experimenting...

Pubdate on the other hand would be easy to scrape and would at least give a reliable source instead of the too frequent garbage dates we get from Worldcat through metadata download (at the cost of it only being a year - at least it is the correct year!). However if I was going to offer Pubdate I would "want" to do series as well.

I shall do some experimentation and see if I can figure out some new xpath combinations that would generically work for the FF screen. TBH that is probably about the only site this would work with, since most sites will just list series name/# as part of the book title and then that means a regex to extract it (like on the clipboard tab) rather than xpath. Which is a whole different level of additional UI complexity!

Last edited by kiwidude; 07-15-2012 at 05:15 AM.
kiwidude is offline   Reply With Quote