Features Discussion

zenocon · 04-13-2010, 01:07 PM

Hi, I've been dabbling with building my own OSS ebook manager. I have a fairly large collection of PDF/CHM files that I use for reference material.

Considering that building your own application like this is a fairly large undertaking, I've also scanned the interwebs for other solutions out there.

Calibre has some great features, and is a great project -- kudos to the author(s). However, if I were going to build it, I'd do a couple things slightly differently, or I'd add a few things. I'd be curious to get the contributor's feedback on this wishlist regarding:

a) something that may be planned for a future release
b) something that can be contributed as a plugin
c) not an option

#1 Automatically finding ISBN from the text itself. I have software that does this with fairly high accuracy. Right now it handles only PDF and CHM, but technically, there isn't anything to prevent it from adding other formats. If the ISBN is in the text, my code will find it. It deals with malformed ISBN and multiple ISBN and dupes. I use AWS to scan Amazon and have an algorithm that can do a fuzzy match based on the file name -- if it is close...in order to detect differences when a text contains more than one ISBN.

This feature is key for me. Using calibre now, I have to do this manually for every file, which is just going to take me waaaayy too long. I know the code is in Python/C++. My code is in Java. You can bridge via JNI. I would love to get this in there somehow. It is my number one showstopper.

#2 Separate the library db itself from scratch/import area. The reason for this is fairly straightforward. I don't really want to import into my main library until after I have tagged everything correctly, and am certain I want to import the books into my library. The view, as it is now...shows everything. What if I dragged some new books into the library, and then sorted...it is hard to find them again, and they are in an inconsistent state w/o proper meta-info. So, I'm looking for a scratch area, where I can drag files in...lookup meta-info, and then select them or select-all, and import into the main library, which essentially just inserts them into SQLLite, and moves the files for me.

#3 Different UI for Edit MetaInformation in Bulk. It would be easier/faster to have this form be on the side of the UI. If you select an individual file, the form is populated. If you select multiple files, the form is populated with any fields that are the same and for variance it can use something like <varies>. The user can simply type in a new string and tab or enter, and it commits it. This is much faster than having to right-click and navigate a menu to get to the form. For an example of how this works with MP3 files, take a look at

In the bottom-left pane, there is a tag form. If you have multiple items selected in the main pane, and you edit that form, it commits them to all items selected. This is what I'm looking for.

#4 Duplicate detection. There is none currently, as far as I can tell. Since all my books do have an ISBN, I'd like to be able to find/detect dupes and easily remove them.

#5 Minor UI gripe: the main pane that shows the list of books should be draggable for re-size, so you can view the meta-info pane below it.

Regards,
Davis

itimpi · 04-13-2010, 01:28 PM

Just a thought - could not 2) be accomplished if during import you just had a tag such as "SCRATCH" added to all the books just imported. Perhaps implmeneted via a new preference that allow you to define tag(s) to be automatically added when books are added to the library. Then as you corrected/checked the metadata this tag would be removed, or you would delete the book if you decided you did not want to keep it.

This would seem to avoid the need for a special scratch area and all the code relating to handling it. It also seems to be something that Kovid would be likely to accept as a new feature as people have been looking for some time at an easy way to flag unread books and this could also satisfy that requirement.

Starson17 · 04-13-2010, 01:31 PM

Quote:

Originally Posted by zenocon

#1 Automatically finding ISBN from the text itself.

This isn't a priority for me, but I don't know why it can't be added to the code or by plugin. I can imagine a button on the single metadata fetch dialog screen that locates the ISBN in one or more formats. Note that Calibre has multiple formats stored under each ebook record, so I do have different ISBN stored together in a single record. I could keep them as separate ebook records, but don't. I assume you realize that ISBN has priority in metadata fetching for Calibre, otherwise it uses author/title.

Quote:

#2 Separate the library db itself from scratch/import area.

This is done with tags. There are lots of threads on this issue. With saved searches, it's even easier. This is of zero interest to me.

Quote:

#3 Different UI for Edit MetaInformation in Bulk. It would be easier/faster to have this form be on the side of the UI.

This might be nice. I do have space on my screen, but it's not an option I would care greatly about. It's the kind of thing I'd need to try out to decide if I liked it.

Quote:

#4 Duplicate detection. There is none currently,

This would be easy to implement. You'd have to define a "duplicate." I wrote code that does a fuzzy match on title when adding ebooks (to allow new formats to go directly into existing records). I did not do a fuzzy match on author. When author matched exactly and title fuzzy matched, I considered the two to be the same book and added the new book to the existing record.

If the only "duplicate" you locate is one with the same ISBN, it would not be of value to me. I have a command line search that will find duplicates in the SQL database, but usually, I just open metadata.db with SQLiteSpy, and call up my saved duplicate search to find dupes. I agree it's not as easy as locating them within Calibre. I considered asking Chaley, (or doing it myself) to add SQL searching into the search line. That would let you save a duplicate SQL search.

Quote:

#5 Minor UI gripe: the main pane that shows the list of books should be draggable for re-size,

Funny, I thought it was.

zenocon · 04-13-2010, 01:48 PM

Hey, thanks for the quick reply!

Quote:

Originally Posted by Starson17

This isn't a priority for me, but I don't know why it can't be added to the code or by plugin. I can imagine a button on the single metadata fetch dialog screen that locates the ISBN in one or more formats. Note that Calibre has multiple formats stored under each ebook record, so I do have different ISBN stored together in a single record. I could keep them as separate ebook records, but don't. I assume you realize that ISBN has priority in metadata fetching for Calibre, otherwise it uses author/title.

This one is a big deal for me. Right now the process is:

1) Drag file into calibre
2) Open file in Acrobat or whatever
3) Visually search for the ISBN number
4) Copy to clipboard
5) Edit meta-info in calibre
6) Paste into ISBN field
7) Search for meta-info
8) Save

This is a long slow process, especially if you have thousands of files. I just won't go through it...I'd rather spend the time to write software to do it for me.

Quote:

This is done with tags. There are lots of threads on this issue. With saved searches, it's even easier. This is of zero interest to me.

The tags aren't easily extend-able, tho, right? As far as I can tell, the tags are fixed, which means I have to hi-jack one of them. Can I add my own custom field....that would be very nice, but I didn't see a way to do this.

Quote:

This might be nice. I do have space on my screen, but it's not an option I would care greatly about. It's the kind of thing I'd need to try out to decide if I liked it.

After working this way in the software I use for managing MP3, I have found it to be a huge UX bonus / time saver. I think it is one of those things that you won't realize how nice it is until after you try it, then you won't ever want to go back. The more you can reduce UI clicks / menu / nav to get the most common stuff done, the better.

Quote:

This would be easy to implement. You'd have to define a "duplicate." I wrote code that does a fuzzy match on title when adding ebooks (to allow new formats to go directly into existing records). I did not do a fuzzy match on author. When author matched exactly and title fuzzy matched, I considered the two to be the same book and added the new book to the existing record.

If the only "duplicate" you locate is one with the same ISBN, it would not be of value to me. I have a command line search that will find duplicates in the SQL database, but usually, I just open metadata.db with SQLiteSpy, and call up my saved duplicate search to find dupes. I agree it's not as easy as locating them within Calibre. I considered asking Chaley, (or doing it myself) to add SQL searching into the search line. That would let you save a duplicate SQL search.

Would love to have this, since I know I have a ton of dupes, and would like to clean it up. I know I can hack SQLLite itself, but was hoping for something inside the app itself.

Quote:

Funny, I thought it was.

The main pane has a horizontal scrollbar. Below that, I don't see anything that allows me to re-size. If I open the cover-flow, same thing...no re-sizing.

The meta-info pane at the bottom shows the cover image, and has text with a vertical scrollbar, but if you were able to drag that and re-size it up, the vert scrollbar could go away, and make it easier to read.

itimpi · 04-13-2010, 02:11 PM

Quote:

Originally Posted by zenocon

The tags aren't easily extend-able, tho, right? As far as I can tell, the tags are fixed, which means I have to hi-jack one of them. Can I add my own custom field....that would be very nice, but I didn't see a way to do this.

Tags are completely customizable - in fact an empty library will have no tags! If you want a new one simply add it either as new entry on the tags line (tags a comma separated here) under the edit metadata dialogs, or via the tag editor.

zenocon · 04-13-2010, 02:18 PM

Quote:

Originally Posted by itimpi

Tags are completely customizable - in fact an empty library will have no tags! If you want a new one simply add it either as new entry on the tags line (tags a comma separated here) under the edit metadata dialogs, or via the tag editor.

My fault, terminology mixup...I meant field, not tag. I know what you are saying with the tags. It would be great, tho to have dynamic fields...apart from the static list of Published, Title, Author(s), Tags, Publisher, Rating, Size(MB), Date, Series.

Starson17 · 04-13-2010, 02:34 PM

Quote:

Originally Posted by zenocon

This one is a big deal for me. Right now the process is:

I'm not sure your process is optimal, but any improvements I'd suggest would still be manual. I'm not a pdf expert, but if pdf's have an isbn field in the internal metadata, Calibre should read it on import. Why not just process your books before import to make sure they have the right isbn? (I don't really know if pdfs have isbn or if Calibre reads it, but that's what I'd check first.)

Quote:

Would love to have this, since I know I have a ton of dupes, and would like to clean it up. I know I can hack SQLLite itself, but was hoping for something inside the app itself.

Kovid is very good about adding code you provide when you want to enhance Calibre. I've seen other requests for duplicate location, so this is a useful feature that others want. Personally, I'd like to see it added as an SQL based search that I could store, rather than a dedicated duplicate location button. You should be aware that Kovid has made it very easy to set up a development environment for Calibre if you want to get your feet wet in Python.

ficbot · 04-13-2010, 02:42 PM

Quote:

Originally Posted by zenocon

My fault, terminology mixup...I meant field, not tag. I know what you are saying with the tags. It would be great, tho to have dynamic fields...apart from the static list of Published, Title, Author(s), Tags, Publisher, Rating, Size(MB), Date, Series.

Kovid has mentioned this is planned for a future release. I remember discussing tags and remarking I was cluttering up my tag list with about 7 different options for where a book was obtained and how I would like a 'source' tag. Kovid said he is considering customizable fields in future releases.

zenocon · 04-13-2010, 02:45 PM

Quote:

Originally Posted by Starson17

I'm not sure your process is optimal, but any improvements I'd suggest would still be manual. I'm not a pdf expert, but if pdf's have an isbn field in the internal metadata, Calibre should read it on import. Why not just process your books before import to make sure they have the right isbn? (I don't really know if pdfs have isbn or if Calibre reads it, but that's what I'd check first.)

There isn't a dedicated internal meta-field for this in the PDF spec, AFAIK, but even if there were, it can't be counted on that it would be valid/populated. I could do something where I batch process all PDFs, scan the text for ISBN, and insert a meta-field, but...it seems like if I already have the ISBN right there, why not use it to fetch the rest of the meta-info from web services?

And other formats may not have meta-info...for example, CHM is just compressed HTML. While one can obviously put meta-info in HTML, you can't count on the files themselves to conform to this....this is why I scan the text myself. Despite how bad that sounds, the ISBN text is almost always within the first 10 or so pages of the document, which means you can quit early.

Quote:

Kovid is very good about adding code you provide when you want to enhance Calibre. I've seen other requests for duplicate location, so this is a useful feature that others want. Personally, I'd like to see it added as an SQL based search that I could store, rather than a dedicated duplicate location button. You should be aware that Kovid has made it very easy to set up a development environment for Calibre if you want to get your feet wet in Python.

Cool, I may look into it. I think I will build a quick prototype on the Air platform for a simple UI that can scan files, and there now exists a bridge between Air and Java http://www.merapiproject.net/ so I can use that to find the ISBN with my Java lib, look up its info with AWS and display a simple, editable grid view. Just curious to see how it would play out.

Starson17 · 04-13-2010, 02:55 PM

Quote:

Originally Posted by zenocon

There isn't a dedicated internal meta-field for this in the PDF spec, AFAIK, but even if there were, it can't be counted on that it would be valid/populated. I could do something where I batch process all PDFs, scan the text for ISBN, and insert a meta-field

That's sort of what I had in mind, but you are right, it's not suitable for all formats, and I do hate pre-processing books before import. I'm sure an ISBN locator to scan the various formats, find the ISBN and populate Calibre's ISBN field would be a desirable and useful bit of code.

Starson17 · 04-13-2010, 02:58 PM

Quote:

Originally Posted by Starson17

You should be aware that Kovid has made it very easy to set up a development environment for Calibre if you want to get your feet wet in Python.

See here.

zenocon · 04-13-2010, 03:01 PM

Quote:

Originally Posted by Starson17

That's sort of what I had in mind, but you are right, it's not suitable for all formats, and I do hate pre-processing books before import. I'm sure an ISBN locator to scan the various formats, find the ISBN and populate Calibre's ISBN field would be a desirable and useful bit of code.

I'll throw this java code up on github soon -- I need to rip it out into a more shippable library API, as I had it as part of a UI project I started and abandoned.

I think it would be useful as a standalone lib that anyone can use for this functionality. I'm probably not going to work on porting it over to another language anytime real soon, since I do have some deps on other OSS java libs like iText (PDF handling).

Once it is there, I'd be happy to look into how to integrate it if there is interest, or anyone can take it and look into a port.

Worldwalker · 04-13-2010, 06:33 PM

I like the idea of a duplicate finder. Very much so. Hopefully it would fix a problem which I have, where I've got several versions of the same book (for example, from the Baen Free Library and the Baen CDs) and they're showing up as separate entries. They are sometimes, though not always, different file types. I'd like to merge them, or at the very least get rid of the spares without having to hunt them down individually.

The way to handle your "scratch" area with tags is easy:

1. Select your whole library, tag it [processed]
2. Import your new books.
3. Select all books that don't have a [processed] tag
4. As you edit each book's metadata, give it a [processed] tag when you're done
5. Repeat 4 until you run out of unprocessed books.

By the way, zenocon, welcome to the nuthouse! Nice to meet you -- and it's great to have someone dropping in and, instead, of saying "calibre doesn't do what I want, you must change it!" saying "I need calibre to do more things, here, want this code?"

kovidgoyal · 04-14-2010, 02:12 AM

1) adding isbn extraction from PDF files is possible. There's some partway code to do that in calibre already. I haven't completed it as I have higher priorities.

2) Sort by date to locate newly added books

3) select multiple books and click the edit meta information button to bring up the bulk edit dialog, there's no need to navigate menus. Since after initial import most people aren't going to be doing bulk metadata edits I dont really see this as particularly important, but I can be convinced otherwise if someone submits a patch. Personally, I'm not interested.

4) Some duplicate detection is in the works and again if you want more, submit a patch. This is not something I am personally interested in.

5) If you want to browse books with full details, just click the lower pain, at which point it becomes a standalone window with next and previous buttons. The same is true for the cover browser (via a preference)

EDIT: Oh and support for custom fields is happening right now

Starson17 · 04-14-2010, 09:28 AM

Quote:

Originally Posted by Worldwalker

I like the idea of a duplicate finder. Very much so. Hopefully it would fix a problem which I have, where I've got several versions of the same book (for example, from the Baen Free Library and the Baen CDs) and they're showing up as separate entries. They are sometimes, though not always, different file types. I'd like to merge them, or at the very least get rid of the spares without having to hunt them down individually.

Duplicates was one of the driving factors for me to write merge record code, which I submitted while Kovid was on vacation, but which I've been using for a month+ now.

Mostly, I had situations where: 1) I'd have two slightly dissimilar titles - just different enough that my fuzzy title matching code for Add Books didn't match, or 2) for books that I'd added to Calibre before I wrote that code, or 3) I'd have two slightly different author names, one with a middle initial and one without.

The Add books code (which has limited duplicate detection) is preventing me from getting too many duplicates, so I didn't feel very strongly about the need to improve the duplicate detection or provide a means of searching for duplicates. Even the best duplicate search algorithm is going to make mistakes. Instead, I wanted a quick and easy way to merge two records I consider to be duplicates, whenever I spotted them. That was the purpose of the merge record code.

I will admit that I ran an SQL search on the database to find all books with identical titles. I then went through that list and used my merge code to merge records where appropriate, so I can see value in better duplicate detection. However, as Kovid says, it's something I'd only use during initial import of a large collection.

After initial import I've handled dupes as I import smaller groups pf books with a combo of the Add books code (that puts duplicate books of a different format into existing records) and merge code (which lets me collapse two books with slightly different titles or authors into a single record).

04-14-2010, 02:12 AM	#14
kovidgoyal creator of calibre Posts: 43,826 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	1) adding isbn extraction from PDF files is possible. There's some partway code to do that in calibre already. I haven't completed it as I have higher priorities. 2) Sort by date to locate newly added books 3) select multiple books and click the edit meta information button to bring up the bulk edit dialog, there's no need to navigate menus. Since after initial import most people aren't going to be doing bulk metadata edits I dont really see this as particularly important, but I can be convinced otherwise if someone submits a patch. Personally, I'm not interested. 4) Some duplicate detection is in the works and again if you want more, submit a patch. This is not something I am personally interested in. 5) If you want to browse books with full details, just click the lower pain, at which point it becomes a standalone window with next and previous buttons. The same is true for the cover browser (via a preference) EDIT: Oh and support for custom fields is happening right now Last edited by kovidgoyal; 04-14-2010 at 02:18 AM.

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
PRS-600 Features I really would like to see...	eosrose	Sony Reader	5	10-01-2010 05:36 AM
Discussion: Forum name	theducks	Astak EZReader	1	09-03-2010 05:33 PM
New Firmware (1.4) Discussion	Stinger	Kobo Reader	103	07-22-2010 11:02 AM
Right now, you can have 2 of 3 features?	surrealmind	Which one should I buy?	10	01-03-2010 10:08 PM
iLiad Community iPDF - Discussion	jharker	iRex Developer's Corner	67	11-17-2007 07:00 PM

04-13-2010, 01:07 PM	#1
zenocon Junior Member Posts: 6 Karma: 10 Join Date: Apr 2010 Device: none	Features Discussion Hi, I've been dabbling with building my own OSS ebook manager. I have a fairly large collection of PDF/CHM files that I use for reference material. Considering that building your own application like this is a fairly large undertaking, I've also scanned the interwebs for other solutions out there. Calibre has some great features, and is a great project -- kudos to the author(s). However, if I were going to build it, I'd do a couple things slightly differently, or I'd add a few things. I'd be curious to get the contributor's feedback on this wishlist regarding: a) something that may be planned for a future release b) something that can be contributed as a plugin c) not an option #1 Automatically finding ISBN from the text itself. I have software that does this with fairly high accuracy. Right now it handles only PDF and CHM, but technically, there isn't anything to prevent it from adding other formats. If the ISBN is in the text, my code will find it. It deals with malformed ISBN and multiple ISBN and dupes. I use AWS to scan Amazon and have an algorithm that can do a fuzzy match based on the file name -- if it is close...in order to detect differences when a text contains more than one ISBN. This feature is key for me. Using calibre now, I have to do this manually for every file, which is just going to take me waaaayy too long. I know the code is in Python/C++. My code is in Java. You can bridge via JNI. I would love to get this in there somehow. It is my number one showstopper. #2 Separate the library db itself from scratch/import area. The reason for this is fairly straightforward. I don't really want to import into my main library until after I have tagged everything correctly, and am certain I want to import the books into my library. The view, as it is now...shows everything. What if I dragged some new books into the library, and then sorted...it is hard to find them again, and they are in an inconsistent state w/o proper meta-info. So, I'm looking for a scratch area, where I can drag files in...lookup meta-info, and then select them or select-all, and import into the main library, which essentially just inserts them into SQLLite, and moves the files for me. #3 Different UI for Edit MetaInformation in Bulk. It would be easier/faster to have this form be on the side of the UI. If you select an individual file, the form is populated. If you select multiple files, the form is populated with any fields that are the same and for variance it can use something like <varies>. The user can simply type in a new string and tab or enter, and it commits it. This is much faster than having to right-click and navigate a menu to get to the form. For an example of how this works with MP3 files, take a look at In the bottom-left pane, there is a tag form. If you have multiple items selected in the main pane, and you edit that form, it commits them to all items selected. This is what I'm looking for. #4 Duplicate detection. There is none currently, as far as I can tell. Since all my books do have an ISBN, I'd like to be able to find/detect dupes and easily remove them. #5 Minor UI gripe: the main pane that shows the list of books should be draggable for re-size, so you can view the meta-info pane below it. Regards, Davis

04-13-2010, 01:28 PM	#2
itimpi Wizard Posts: 4,552 Karma: 950151 Join Date: Nov 2008 Device: Sony PRS-950, iphone/ipad (Marvin/iBooks/QuickReader)	Just a thought - could not 2) be accomplished if during import you just had a tag such as "SCRATCH" added to all the books just imported. Perhaps implmeneted via a new preference that allow you to define tag(s) to be automatically added when books are added to the library. Then as you corrected/checked the metadata this tag would be removed, or you would delete the book if you decided you did not want to keep it. This would seem to avoid the need for a special scratch area and all the code relating to handling it. It also seems to be something that Kovid would be likely to accept as a new feature as people have been looking for some time at an easy way to flag unread books and this could also satisfy that requirement. Last edited by itimpi; 04-13-2010 at 01:33 PM.

04-13-2010, 06:33 PM	#13
Worldwalker Curmudgeon Posts: 3,085 Karma: 722357 Join Date: Feb 2010 Device: PRS-505	I like the idea of a duplicate finder. Very much so. Hopefully it would fix a problem which I have, where I've got several versions of the same book (for example, from the Baen Free Library and the Baen CDs) and they're showing up as separate entries. They are sometimes, though not always, different file types. I'd like to merge them, or at the very least get rid of the spares without having to hunt them down individually. The way to handle your "scratch" area with tags is easy: 1. Select your whole library, tag it [processed] 2. Import your new books. 3. Select all books that don't have a [processed] tag 4. As you edit each book's metadata, give it a [processed] tag when you're done 5. Repeat 4 until you run out of unprocessed books. By the way, zenocon, welcome to the nuthouse! Nice to meet you -- and it's great to have someone dropping in and, instead, of saying "calibre doesn't do what I want, you must change it!" saying "I need calibre to do more things, here, want this code?"

Advert

Advert