![]() |
#1 |
Calibre Plugins Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,721
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Metadata scraper plugin api
One of the plugins I have been considering writing is mentioned in this post.
I like the idea of a plugin into which you can add scripts to scrape data from websites for populating the book metadata in Calibre. I know Kovid is working on rewriting the metadata download api, but I doubt (correct me if wrong!) that he is considering going to this extent. The idea would be that users could right click on a book and run one of their scripts, which would scrape the data and populate whatever metadata fields they chose, be it identifiers, standard metadata fields like series/tags, or custom columns. I believe this sounds very similar to the news recipe stuff in terms of basic infrastructure (scripts written in python, able to be added by users etc). Assuming you are still reading and haven't thought this a really bad idea, the first consideration I have is the API that users would have available in their scripts. I don't want to try to get too clever in terms of restricting to make it user friendly. At the end of the day the scripts will be written by people who will have to know Python and any attempt to wrap stuff will invariably lead to restrictions for future versions of Calibre I will come to regret. I like what Kovid did with the plugins API in terms of not limiting the sandbox you can play in even though this means you have to get a bit dirty in poring through Calibre source code to learn how to do stuff. With that said, the plugin is focused around scraping data for a book. So getting the user to write code that inherits from a base class and overrides a function that is passed a populated Metadata object for the current row seems sensible. On that object the user can get/set as they please all the standard metadata fields and identifiers. What I don't believe they can get access to (Charles will correct me if wrong!) is the custom column fields, which are on the db object. So I could also pass in a db object (which would give users more flexibility by also letting them do things like scrape covers). All thoughts welcome - good/bad idea, other fields I should consider passing etc. It's just vapourware at this point. EDIT: A potential technical challenge - is it possible to write a script that inherits from a class that exists only in a Calibre plugin zip file? Or would it require the base class to sit in the Calibre codebase, which could rather scupper the whole idea without Kovid's support. I guess the other consideration with this is you could argue that this plugin functionality wise does overlap with standard metadata and cover download plugins - the difference being that the user would be able to granularly choose which to run and (perhaps) retrieve more data fields than is possible using the current API at least. Perhaps the whole idea does become redundant with Kovid's new API - I'm just guessing at this point Last edited by kiwidude; 03-06-2011 at 07:01 AM. |
![]() |
![]() |
![]() |
#2 | ||||
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 12,336
Karma: 8012652
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
Quote:
Quote:
I am not sure whether passing a db handle is a good idea or not. If you pass it, you are giving away the keys to the city. If you don't, then there are some things like plugin custom data that the subclass would be unable to get. Quote:
|
||||
![]() |
![]() |
Advert | |
|
![]() |
#3 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,190
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
The only use I can see for this functionality is to have plugins that populate custom columns. And a better solution for that, IMO, is to use a plugboards type functionality which allows the user to tell a standard metadata download plugin to copy/move some metadata from a standard field to a custom field.
|
![]() |
![]() |
![]() |
#4 | |
Calibre Plugins Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,721
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Thanks chaley/Kovid for the replies.
Quote:
If you can do that, then in theory you should be able to do most of what I want with a metadata download plugin (custom columns set on an mi returned by the API will get saved right?). So I think it requires a few things: (1) user control over which plugin to run. By user control I don't mean the chore of drilling into the plugin preferences dialog and enabling/disabling plugins. Instead a simple right-click to execute a particular metadata download or cover download plugin that you have installed. (2) Enhanced versions of the metadata download plugins to retrieve more data than just the "standard" fields they do currently. So for instance an example by the OP in the thread I referenced above was getting the price for a book. The enhanced metadata download plugin would need to scrape all the data it can that might be "interesting" for the user to choose from? (3) Configuration to grab the data and assign it to columns for that plugin. As you have said a plugboard type approach could be used, but I don't think that just the "standard" metadata fields would be sufficient? Any thoughts? |
|
![]() |
![]() |
![]() |
#5 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,190
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
It's easy enough to stuff non-standard metadata into a dict that the plugboard system can use to populate user specified custom columns.
|
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 12,336
Karma: 8012652
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
putting aside the finding of interesting data ...
If, for example, a plugin assigns a value to mi.price, then the template engine can retrieve that value. Currently doing so requires using the raw_field function, but there is no overriding reason that normal template references couldn't work. The only issue is field-specific formatting, but given that the template engine cannot know what the field means, the only thing the engine can choose to do is produce a string. Given the above, then it is easy to imagine 'plugboards' that massage raw data and store it into arbitrary metadata fields. We have something similar to that function in metadata search/replace, which handles type and is_multiple mismatches. As far as I can see, the creation and maintenance of the screen maps describing where 'interesting' information is and how it is to be scraped is the hard part. Especially maintenance, given that the page formats change on a regular basis. I did something like this for regression testing of an application. It turned out that maintaining the testing scraper was as hard as maintaining the application. |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Metadata downlad plugin trouble | Daermond | Plugins | 9 | 05-01-2025 02:25 AM |
[Metadata Download Plugin] Goodreads Metadata **Deprecated** | kiwidude | Plugins | 30 | 04-23-2011 02:10 PM |
metadata plugin | redneck_momma | Plugins | 1 | 05-21-2010 08:41 PM |
calibre now uses the Google Books API to get metadata | kovidgoyal | Calibre | 9 | 03-23-2009 09:36 PM |
Ubook plugin api | Dopedangel | Reading and Management | 0 | 08-25-2007 06:54 AM |