07-02-2012, 09:57 PM | #16 |
reader, ebook junkie
Posts: 109
Karma: 436806
Join Date: Dec 2007
Location: western nebraska
Device: droid, kindle, kobo, eslick, sony
|
I was just thinking of the implications with importing CVS files, all the other uses mentioned would be very handy. I hadn't realized that this plugin would let us create lists from websites, what a time saver.
For example, when I find an author new to me, I'll go to their website first for their backlist. Amazingly many authors don't list their backlist or haven't updated their website in years. So, then I start compiling titles from research done with Goodreads and Library Thing, supplemented by fictfact and fantastic fiction. Usually calibre author profile open on one side of the screen and the pertinent website on the other side as I type away in calibre. I'm brain-dead at the moment, so I can't think of many compilation websites except those mentioned above. Here's a few individual author sites that come to mind as I remember typing lots of their books into calibre. http://www.pinbeambooks.com/ebooks-y...niverse%C2%AE/ http://michellesagara.com/bibliography/ http://www.dendarii.com/inprint.html http://www.jdrobb.com/books/allbooks.php Some really nice authors have downloadable lists, thank you JD Robb, but many only have titles and images. I'm assuming that if I wanted to import from a website and that if book data couldn't pulled by the plugin, that the plugin would just give an error message, so that I would know that manual entry was needed. |
07-03-2012, 10:20 AM | #17 |
Calibre Plugins Developer
Posts: 4,636
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
@ElizabethN - thanks for the sites, I shall take a look at home where I won't be battling against work web filters. I will say that provided an author is at Fantastic Fiction then there isn't normally any need to use an author specific site. For the authors it covers the FF site is generally very good. And it should save you the data entry you mention.
Title/author is all the data needs - in fact it is all the plugin currently allows you to extract be it from clipboard, CSV or the web. It also supports "title only" matching with the obvious downsides that will have. In theory I could add support for other metadata fields, but as this is not intended as a replacement for metadata download and would clutter the UI then title/author should be sufficient. You can let the plugin create the empty books, and then do a metadata download to get the rest of the data. Just on your final point. If you point the plugin at a website that isn't bundled with it, then you almost certainly won't get any menaningful data from the page. Every website needs its own configuration, because every website displays different html which we scrape the data from. Just like we have different metadata download plugins for different websites. The good news is that frequently a single website will use the same html template for all its pages - hence why a single template can cater for any author from Fantastic Fiction rather than a different one for each author for instance. Creating the configuration for a website does require some xpath knowledge so I don't anticipate every user out there diving in to do it but the ability is there for those who want to and export it. The more websites over time I bundle with this plugin the more generally useful it may be out of the box. |
07-03-2012, 09:59 PM | #18 | |
reader, ebook junkie
Posts: 109
Karma: 436806
Join Date: Dec 2007
Location: western nebraska
Device: droid, kindle, kobo, eslick, sony
|
Quote:
Sites like Fantastic Fiction and Goodreads are usually my first choice for an author's backlist as the info on an author's website can range from very detailed to just a book image. Sounds like another useful plugin, looking forward to it's release. No rush though as the problem that I've discovered with plugins is that each additional plugin increases the time I spend manipulating data which then decreases the time I spend reading. If only I didn't feel the need to keep perfecting my library or have to sleep or work... |
|
07-14-2012, 02:05 PM | #19 |
Calibre Plugins Developer
Posts: 4,636
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
v0.1 Beta
Here at last is a version for people to play with. I've updated the screenshots on the first page - quite a number of things have changed over the last month as I have refined things.
There's a lot of subtle, hidden behaviour that I won't bore people with at this point. Be sure to look at right-clicks on the various grids to see some of it. Also in most lists/grids double-clicking on things tends to shortcut a lot of the action. I have also incorporated chaley's template language so a number of the URLs such as with Goodreads now dynamically resolve their dates to the URL using that, such as "Popular this Month" or "Popular this Year". You can customize what columns to display in the "Resolve" page of the wizard using the "Options" button for the wizard (you must close/reopen it to take effect). Note that any columns you add will be read-only. For a quick example of how the workflow works for using a predefined web page list setting: Spoiler:
Another example - getting books for an author via Fantastic Fiction: Spoiler:
If you want to load your own list of text from the clipboard (such as copied from a forum post or web page): Spoiler:
To load from a CSV file (such as a calibre CSV catalog file, a Goodreads export or whatever): Spoiler:
To scrape from a different website not already configured in this plugin: Spoiler:
Just have a play around and experiment - you can't harm your library in any way (at worst you will create some empty books if you choose to do so and click Finish on the last wizard page). It may be that you never actually "import" a list, and instead just use the plugin as a quick way to launch various websites from the category view of predefined sites. There's probably around a 100 website pages all preconfigured at this point covering various types of lists be they "popular", "bestselllers", "new releases", "top xxx" or indeed just bibliography style with Fantastic Fiction. I'll add to this over time - if you have a site/page not covered and want to see it included just feel free to ask - I don't expect everyone to be bothered with figuring out xpath expressions though it can be a fun challenge at times to do so if you are so inclined... Last edited by kiwidude; 07-15-2012 at 09:22 AM. Reason: Removing attachment as later version in this thread |
07-14-2012, 05:51 PM | #20 |
Guru
Posts: 655
Karma: 64171
Join Date: Sep 2010
Location: Kent, England, Sol 3, ZZ9 plural Z Alpha
Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)
|
Just gave it a try on a new author for my mother's library and found it worked very well (used the FantasticFiction route).
One extra function I'd request, even though I know it'd be difficult, would be that on import as well as Author & Title, to also include some other fields (Series, Series# and PubDate), especially when getting info from FF - as the info is on the page, just a matter of being able to scrape and use it. Many thanks for your work on this, even as it is it's a great time-saver. |
07-15-2012, 05:00 AM | #21 |
Calibre Plugins Developer
Posts: 4,636
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
@Perkin - thanks for giving it a whirl and the feedback. Yeah I briefly mentioned my thoughts about other metadata fields above to ElizabethN - there are two issues with it. The first is the extra clutter it would add to the UI gui for something that is so rarely available in a usable fashion. The second is actually getting a quality source for it. From a CSV file no problem. However from a web page very few pages that display books in a list will put the series information in a reliable structured fashion. Everything becomes very bespoke and series data is ordinarily scraped from the individual page for a book (in fact my FF metadata plugin does not scrape the web page for it - it fires the same database query that is used to construct the page by FF that gets a JSON result). You can see just looking at the FF page the difficulties involved - series name is just placed in a <strong> tag that appears there "sometimes", their HTML is not structured very nicely at all.
Edit - actually getting the series name is not that difficult (though I found a bug in the plugin while doing so) - it is series # that is difficult. Still experimenting... Pubdate on the other hand would be easy to scrape and would at least give a reliable source instead of the too frequent garbage dates we get from Worldcat through metadata download (at the cost of it only being a year - at least it is the correct year!). However if I was going to offer Pubdate I would "want" to do series as well. I shall do some experimentation and see if I can figure out some new xpath combinations that would generically work for the FF screen. TBH that is probably about the only site this would work with, since most sites will just list series name/# as part of the book title and then that means a regex to extract it (like on the clipboard tab) rather than xpath. Which is a whole different level of additional UI complexity! Last edited by kiwidude; 07-15-2012 at 05:15 AM. |
07-15-2012, 05:25 AM | #22 |
Calibre Plugins Developer
Posts: 4,636
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Ok, here is an xpath "challenge" for someone (I need to go do some other things so if someone solves it for me in the meantime I shall be happy!)... lets say you have this html:
Code:
<strong>Alex Cross</strong> <br> 1. <a href="/p/james-patterson/along-came-spider.htm">Along Came a Spider</a> <span class="year"> ( <a href="/years/1992.htm">1992</a> ) </span> <br> 2. <a href="/p/james-patterson/kiss-girls.htm">Kiss the Girls</a> <span class="year"> ( <a href="/years/1994.htm">1994</a> ) </span> <br> title: text() pubdate: following-sibling::span[@class="year"]/a/text() series name: ../strong/text() series #: ??? For series number I thought I could do something like: preceding-sibling::text() but that doesn't give me any results. Any other suggestions? Last edited by kiwidude; 07-15-2012 at 05:30 AM. |
07-15-2012, 07:31 AM | #23 |
Grand Sorcerer
Posts: 11,734
Karma: 6690881
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
The number is part of the parent, not a sibling, because <br> is self closing.
It isn't obvious to me how to isolate those numbers. If you know that there is a number for each title, and if the numbers are sequential, then you can do it by counting them, but I suspect there are too many 'if's involved. You might be able to do it by getting the text of the parent and counting lines. What does the parent html block look like? |
07-15-2012, 09:22 AM | #24 |
Calibre Plugins Developer
Posts: 4,636
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Hi Charles,
Yeah I had considered a fallback to auto-number the series index based on whether they have a series name. But that has a few problems - such as when FF list a book in a series that is written with other authors and only show the book written by that author - it would always make it "number 1" when it isn't. So it really needs the associated number off the page. Here is the URL being parsed in this example above: http://www.fantasticfiction.co.uk/p/james-patterson/ The parent expression I am using to identify only the titles on the page that are of interest is: //div[@class="sectionleft"]/a[contains(@href,".htm")] You will see that unfortunately there is no true "parent" for each "row". There are just a number of div sections for each series or grouping of titles, with a title contained within the a href. Hence why I am using that <a> tag as my row identifier and then grabbing data relative to that. I've attached a new version 0.2 below - this adds the Pubdate implementation and fixes a couple of bugs. Last edited by kiwidude; 07-15-2012 at 04:41 PM. Reason: Removing attachment as later version in this thread |
07-15-2012, 09:57 AM | #25 |
Grand Sorcerer
Posts: 11,734
Karma: 6690881
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
The following seems to work, but I make no guarantees. It produces a list of numbers and a list of titles. The cruft in the middle is necessary to filter out ancillary text such as "aka". As far I can tell from brief looks, the numbers and titles correspond until the numbers run out. The titles after the numbers run out seem to be anthologies or other "non-numbered" books.
This script runs with calibre-debug -e Code:
from lxml import html import urllib2 from calibre import browser from contextlib import closing url = 'http://www.fantasticfiction.co.uk/p/james-patterson/' br = browser() with closing(br.open(url, timeout=10)) as f: doc = html.fromstring(f.read()) for data in doc.xpath(('//div[@class="sectionleft"]')): t = data.xpath('./text()') numbers = [] for x in t: try: f = float(x) numbers.append(int(f)) except: pass books = data.xpath('a[contains(@href,".htm")]/text()') print len(numbers), len(books), numbers, books |
07-15-2012, 01:32 PM | #26 |
Calibre Plugins Developer
Posts: 4,636
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Hi Charles,
Thanks for that. And yeah it requires a bit of rejigging of the way I currently iterate through matches to try to accommodate it - since my previous "assumption" was that if a user specified a "row xpath" then there would only be one result for a title/author etc xpath. However on the FF site it does all have to be treated rather differently, and a "Row" is really a "section" of the document, with potentially multiple matches inside it. I'm hacking the code around to see if I can make it all work without breaking everything else, we shall see what falls out at the end... thanks again. |
07-15-2012, 01:57 PM | #27 |
Calibre Plugins Developer
Posts: 4,636
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Success...
Now I have to plumb in all the rest of the series support through the rest of the wizard... |
07-15-2012, 04:40 PM | #28 |
Calibre Plugins Developer
Posts: 4,636
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
v0.3 Beta
Here is a new version with series name/index fully plumbed in along with pubdate. So you can now for instance import all the data available from a Fantastic Fiction page for an author into empty books and get "proper" publication years as well as the series information.
I've also fixed a few other bugs I found and some predefined settings that needed tweaking. This is probably close enough to a 1.0 release by my standards but I will let it sit as a beta for a while to see if anything else comes up in terms of feedback. Last edited by kiwidude; 07-15-2012 at 07:15 PM. Reason: Removing attachment as later version in this thread |
07-15-2012, 05:55 PM | #29 |
Guru
Posts: 655
Karma: 64171
Join Date: Sep 2010
Location: Kent, England, Sol 3, ZZ9 plural Z Alpha
Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)
|
Reminder to users: after uninstalling older beta, restart calibre and then install the beta 0.3, then restart calibre again.
Without the restart(s) the xpath expressions weren't correctly filled. @kiwidude, with regards the FF import, tried it on a few authors with mixed amounts of series/individual novels etc, worked perfectly. Most impressive. One major gripe, why couldn't you have done this last year, and saved me hours of tedious tracking down author lists Gonna test it some more. Many thanks. |
07-15-2012, 06:18 PM | #30 |
Guru
Posts: 655
Karma: 64171
Join Date: Sep 2010
Location: Kent, England, Sol 3, ZZ9 plural Z Alpha
Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)
|
Just found one problem, if the book is in a series but hasn't got a number, then no series or # is generated, would it be possible in those cases to use a '0' for the # and still keep the series. (May just be a tweak for the expression, probably not, but I thought I'd ask anyway, just in case. )
I noticed when I did a test on the page for Sir Arthur Conan Doyle, I noticed it didn't generate the series for the Gerard stories, as they have no numbering. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
[Old Thread] Feature Idea - Auto convert on import | jphphotography | Calibre | 6 | 11-04-2012 09:17 PM |
[GUI Plugin] WebOS Kindle-Import | CranstD | Plugins | 0 | 01-24-2012 03:36 PM |
No Module name Tkinter on plugin import | foghat | Plugins | 1 | 11-11-2010 07:11 PM |
New Plugin Type Idea: Library Plugin | cgranade | Plugins | 3 | 09-15-2010 12:11 PM |
Run plugin before import | dremo | Plugins | 6 | 01-09-2009 12:40 PM |