Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 05-06-2010, 07:31 PM   #1
Giuseppe Chillem
Groupie
Giuseppe Chillem doesn't litterGiuseppe Chillem doesn't litter
 
Giuseppe Chillem's Avatar
 
Posts: 191
Karma: 134
Join Date: May 2010
Device: IREX DR1000
First Requestes and POV of a Newcomer

Hello,
it is the second day I am using calibre.

Yesterday it crashed while importing my ebooks (over 1000). Today I have split the library in 10 parts to avoid the crash. It worked !

Here is the experience which results from the first 36h of use

1) There is no date-time of addition, only date. This is a problem: if you add files to a huge DB and then you want to rename whose have a bad import name you have to scan the whole DB. Having Date-Time together you can sort via this field and all the lastest entries will be at the top or at the bottom of the list. This will make finding bad book names easier imho.

2) As I have written the program crashed while importing 1041 books. This creates a dangerous situation. Calibre seems to recognize same file by bookname. You can have duplicates in 2 situations. The first: Filename is the Same; The second; Tags of the files are Unset and this lead to duplicates. If the program crashes and/or you reimport an already imported directory there will be both of those 2 kind of duplicates. Instead the program should be able to skip file which are physically the same from files with the same inferred tittle.
Physical Sameness should be taken into account when importing files. It is easy to understand if the same physical file has been alread imported and so it need to be skipped: use a combination of CRC32 and SIZE. Add those field to the DB. It will give 99.9999999% of accurancy.

3) Auto Tagging: Many of us have already created some directory structure to categorize books: one dir = one or more tags, all the same. During import it would be good to automatically select one or more tags to be set into the imported books

Ok, that's all.

Giuseppe Chillemi
Giuseppe Chillem is offline   Reply With Quote
Old 05-06-2010, 08:41 PM   #2
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by Giuseppe Chillem View Post
1) There is no date-time of addition, only date.
Calibre stores date and time, it just doesn't display the time. Sorting sorts by time as well as date.

Quote:
2) As I have written the program crashed while importing 1041 books.
I added over 8,000 books in one shot without a crash.

Quote:
Calibre seems to recognize same file by bookname. You can have duplicates in 2 situations. The first: Filename is the Same; The second; Tags of the files are Unset and this lead to duplicates. If the program crashes and/or you reimport an already imported directory there will be both of those 2 kind of duplicates. Instead the program should be able to skip file which are physically the same from files with the same inferred tittle.
If you turn on the option under Preferences |Add/Save| "If books with similar titles and authors found, merge" calibre will automatically skip the same book in the same format. If that option is off, it offers to add the book as a duplicate record.

Quote:
3) Auto Tagging: Many of us have already created some directory structure to categorize books: one dir = one or more tags, all the same. During import it would be good to automatically select one or more tags to be set into the imported books
You can add tags to groups of ebooks by selecting them and bulk adding a tag.
Starson17 is offline   Reply With Quote
Old 05-06-2010, 09:07 PM   #3
speakingtohe
Wizard
speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.
 
Posts: 4,812
Karma: 26912940
Join Date: Apr 2010
Device: sony PRS-T1 and T3, Kobo Mini and Aura HD, Tablet
Just curious. How do you tell the file is physically the same.? File name and size or scanning the whole book? If file name and size that is not foolproof but pretty easy for a person to do manually.
Calibre seems to have a much more sophisticated approach which does not always find all books the same but a remarkable number.
AFAIK it has not got all duplicates (impossible I think) but I imported 47,000 files and ended up with 21,000 ebooks.
I had sorted out the obvious duplicates based on file name/size (took about an hour). Needless to say I was impressed at how much work Calibre did for me. And unlike you imply it might have not detected some, but does not seem to have mismatched any. And Calibre does not ever destroy your original copy. Nothing dangerous there that I can see.

I don't imagine it will ever perform magic tasks such as figuring out each individual users file naming conventions and directory structures, but if you spend a little time using it it will make it easier for you to do this yourself.

For instance you could add the files from one directory or group of directories and use bulk edit to put in the appropriate tags.

BTW Calibre crashed for my first import try but did import my 47,000 files without crashing when I used my spare laptop solely for that purpose. Took about day to do it but it did it.

Helen
speakingtohe is offline   Reply With Quote
Old 05-06-2010, 09:45 PM   #4
Worldwalker
Curmudgeon
Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.
 
Posts: 3,085
Karma: 722357
Join Date: Feb 2010
Device: PRS-505
I had a problem with crashes when I was importing about 1000 files, but that was with a build from a few months back. It's behaved itself ever since for me, at least.

To the OP, here's a way to get the right tags on the right files:

1. Mark all your existing books with a tag like the one I use: [processed]
2. Import your new books.
3. Search for those books that don't have a [processed] tag (I keep this as a saved search)
4. Control-A to select all the non-[processed] books you've found.
5. Bulk edit and put in your tag(s) of choice for that group.
Repeat 2-5 until you've imported them all.

Though ... thinking about it ... we have options to set the book title, etc., from the file name ... when importing books from a single folder, or a tree of folders, it would kind of be nice to have an option to have it automatically set a based on the folder name(s) starting with the one you selected. So if you start your import in \fiction, which has below it \mystery, \fantasy, and \sf, and below \sf you have \retro and \military, every book imported in that batch would get a "fiction" tag, plus, if relevant, "mystery", "fantasy", or "sf", and some of the "sf" books would get "retro" or "military" too.

The idea of automatically assigning tags on import has been requested before. So how about options like:

--------------------------------------------------------------
IMPORT AUTO-TAGGING OPTIONS
[ ] strip all existing tags
[ ] assign the following tag string: [______________________]
[ ] assign tags by folder names
--------------------------------------------------------------

So if I'm importing a bunch of books from PG, I could pick the first and third options, and it would ditch PG's crappy LoC tags and assign each one a tag of gutenberg, so I know where it came from.

I think Giuseppe's idea, and mine (and several other people's in the past) might be worth following up on. It's less of an issue for those of us who have our collections in calibre already, but it would sure make life easier on someone with a few thousand books to import, and make the transition from filesystem-as-metadata to tags-as-metadata much easier for newer users. (and the latter might qualify as a "fewer users bugging Kovid" class of improvement)
Worldwalker is offline   Reply With Quote
Old 05-07-2010, 07:59 AM   #5
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by Worldwalker View Post
I had a problem with crashes when I was importing about 1000 files, but that was with a build from a few months back. It's behaved itself ever since for me, at least.
Crashes are most often associated with importing pdf's. If he can avoid those in large imports, it may help.

Quote:
The idea of automatically assigning tags on import has been requested before. So how about options like:
... I think Giuseppe's idea, and mine (and several other people's in the past) might be worth following up on.
There are basically two options - write it yourself (calibre has an easy to set up development environment) or post a request in the bug tracker and hope someone else agrees with you that it's useful and decides to write it for you.
Starson17 is offline   Reply With Quote
Old 05-07-2010, 10:06 AM   #6
Worldwalker
Curmudgeon
Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.Worldwalker ought to be getting tired of karma fortunes by now.
 
Posts: 3,085
Karma: 722357
Join Date: Feb 2010
Device: PRS-505
Quote:
Originally Posted by Starson17 View Post
There are basically two options - write it yourself (calibre has an easy to set up development environment) or post a request in the bug tracker and hope someone else agrees with you that it's useful and decides to write it for you.
Hopefully a month or so from now, I'll have finished moving and be set up in my new place. Then I guess it'll be time to learn Python.
Worldwalker is offline   Reply With Quote
Old 05-07-2010, 11:05 AM   #7
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by Worldwalker View Post
Hopefully a month or so from now, I'll have finished moving and be set up in my new place. Then I guess it'll be time to learn Python.
You'll have fun. Python is easy to work in and contributions are always welcome.
Starson17 is offline   Reply With Quote
Old 05-07-2010, 12:02 PM   #8
Giuseppe Chillem
Groupie
Giuseppe Chillem doesn't litterGiuseppe Chillem doesn't litter
 
Giuseppe Chillem's Avatar
 
Posts: 191
Karma: 134
Join Date: May 2010
Device: IREX DR1000
Quote:
Originally Posted by speakingtohe View Post
Just curious. How do you tell the file is physically the same.? File name and size or scanning the whole book? If file name and size that is not foolproof but pretty easy for a person to do manually.
Helen
If you have a directory full of e-books, each ebook has size and CRC32 which are always the same. If you store the CRC32 into the DB of Calibre and you sercch for it (and file size) during a new import you are able to match with 100% accurancy if the file is duplicate and already been imported.
CRC32 an filesize is the way software which search for duplicated files adopt to find duplicates on the hard drives (a good program for this is CSPY, "Clone Spy")

Giuseppe Chillemi
Giuseppe Chillem is offline   Reply With Quote
Old 05-07-2010, 12:11 PM   #9
Giuseppe Chillem
Groupie
Giuseppe Chillem doesn't litterGiuseppe Chillem doesn't litter
 
Giuseppe Chillem's Avatar
 
Posts: 191
Karma: 134
Join Date: May 2010
Device: IREX DR1000
Quote:
Originally Posted by Starson17 View Post
Calibre stores date and time, it just doesn't display the time. Sorting sorts by time as well as date.
This is a good news.

Quote:
Originally Posted by Starson17 View Post
I added over 8,000 books in one shot without a crash.
The fact you were luky doesn't mean there isn't a bug inside the program :-)

Quote:
Originally Posted by Starson17 View Post
If you turn on the option under Preferences |Add/Save| "If books with similar titles and authors found, merge" calibre will automatically skip the same book in the same format. If that option is off, it offers to add the book as a duplicate record.
Here comes the program. Many books I have imported, while different, have the same tag. For example I have found a lot of "UNREGISTERE CHP Professional" duplicates.

CRC32 + FILESIZE gives a 100% accurate match on duplicates. It is very easy to implement, file size is already there and CRC32 can be 1) Calculated with a simple function available in phyton 2) Inherited form the file system which should have CRC32 stored in file header (if I am not wrong)

Quote:
Originally Posted by Starson17 View Post
You can add tags to groups of ebooks by selecting them and bulk adding a tag.
As time + date are already stored there is a simpler solution: sort the books by import date and then bulk add files.

Giuseppe Chillemi
Giuseppe Chillem is offline   Reply With Quote
Old 05-07-2010, 12:13 PM   #10
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by Giuseppe Chillem View Post
If you have a directory full of e-books, each ebook has size and CRC32 which are always the same. If you store the CRC32 into the DB of Calibre and you sercch for it (and file size) during a new import you are able to match with 100% accurancy if the file is duplicate and already been imported.
This is true, but it doesn't help you very much to know that it's 100% the same book. Maybe the user wants to add it anyway. I had lots of books that were CRC matched, but had different filenames. Each multiple author book was stored under both author names with the author name as part of the title. I wanted them added until I could edit the metadata and list the multiple authors on one copy, then delete the other.

It seems more useful to me to identify duplicates based on title and/or author, then ask. Most of my duplicates weren't 100% CRC duplicates anyway.
Starson17 is offline   Reply With Quote
Old 05-07-2010, 12:22 PM   #11
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by Giuseppe Chillem View Post
The fact you were luky doesn't mean there isn't a bug inside the program :-)
The only known crashing relates to reading metadata from pdf books. That's due to the library calibre uses for reading pdf metadata and/or malformed pdfs, not calibre. Turn off pdf metadata reading or just don't import them with other books.

Quote:
Here comes the program. Many books I have imported, while different, have the same tag. For example I have found a lot of "UNREGISTERE CHP Professional" duplicates.
I'm not quite sure what you are saying, but if you are saying that you get duplicate titles, or other duplicate metadata, that often comes from importing books with bad internal metadata. I usually read metadata only from the filename, unless I'm sure that the books being imported have good internal metadata.

Quote:
Quote:
You can add tags to groups of ebooks by selecting them and bulk adding a tag.
As time + date are already stored there is a simpler solution: sort the books by import date and then bulk add files.
It looks like you didn't understand what I was suggesting. I suggested that you add the books, sort them any way you want (by time added, author, etc.), select them and bulk edit the metadata to add the tags you want.
Starson17 is offline   Reply With Quote
Old 05-07-2010, 01:00 PM   #12
Giuseppe Chillem
Groupie
Giuseppe Chillem doesn't litterGiuseppe Chillem doesn't litter
 
Giuseppe Chillem's Avatar
 
Posts: 191
Karma: 134
Join Date: May 2010
Device: IREX DR1000
Quote:
Originally Posted by Starson17 View Post
This is true, but it doesn't help you very much to know that it's 100% the same book. Maybe the user wants to add it anyway. I had lots of books that were CRC matched, but had different filenames. Each multiple author book was stored under both author names with the author name as part of the title. I wanted them added until I could edit the metadata and list the multiple authors on one copy, then delete the other.

It seems more useful to me to identify duplicates based on title and/or author, then ask. Most of my duplicates weren't 100% CRC duplicates anyway.
Believe me: if CRC32 + FILESIZE are the same in 2 books these 2 books are identical ! This is the tecnique used by all software which scan for duplicates in the hard drive.

There is only a 0,00000001% of chance they are different. So 100% is 99,99999999

Giuseppe Chillemi
Giuseppe Chillem is offline   Reply With Quote
Old 05-07-2010, 01:22 PM   #13
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,598
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
The problem isn't that files with the same hash will be different, the problem is that files with different hashes may be the same. For example they may have slightly different metadata or have stored annotations, and still be logically the same book.
kovidgoyal is offline   Reply With Quote
Old 05-07-2010, 02:03 PM   #14
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by kovidgoyal View Post
The problem isn't that files with the same hash will be different, the problem is that files with different hashes may be the same.
Exactly. During my importing process, I found ebooks I'd obtained from the Gutenberg Project years ago, and later versions of the same book from GP that had been edited to fix scanning errors. Many of those near-duplicates I'd originally obtained in format 1 had been converted to formats 2 and 3 in one of my mass conversion efforts, which produced more near-duplicates.

I'm not saying it's useless information to know which have the same hash, but Calibre can't use that information to automatically do anything for me. It will still have to ask what I want done. Sometimes if the hash matches, I want it added anyway (multiple author situations) and other times, even with hash differences, I don't want it added (it's the same book, but an earlier version without my bookmarks or with scanning errors not yet corrected).
Starson17 is offline   Reply With Quote
Old 05-07-2010, 03:19 PM   #15
Giuseppe Chillem
Groupie
Giuseppe Chillem doesn't litterGiuseppe Chillem doesn't litter
 
Giuseppe Chillem's Avatar
 
Posts: 191
Karma: 134
Join Date: May 2010
Device: IREX DR1000
Quote:
Originally Posted by kovidgoyal View Post
The problem isn't that files with the same hash will be different, the problem is that files with different hashes may be the same. For example they may have slightly different metadata or have stored annotations, and still be logically the same book.
You are right. Nice shot !

However, if you are in the early stage of book inporting (for example, merging book collections), and you have not changed metadata, with CRC32 + SIZE you have a 100% hit.

Thanks to your POV I whish to change a little my request.

Here is the target scenario:

Calibre crashes during inport. Part of files have been inported. Some of these files have metadata equal to other (I have found some CHM having the same "Generated by Unregistered Version"). If you discard duplicates, you discard false duplicates too. It does actually happen, I have ecnountered this problem just the first time I have used Calibre.

Here Is the proposal:

A two round check, the first is CRC32 + SIZE, the second is the actual mechanism. This would give you 3 lists: 1) physical duplicates, 2) Physical and Metadata Duplicates 3) Metadata Duplicates.

Then you request the user: DUPLICATES FOUND, what you want to delete ? "Same Physical Files; Same Physical Files + Metadata; Only Metadata; None"

What you think about this proposal ?

Giuseppe Chillemi
Giuseppe Chillem is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Newcomer falchion Introduce Yourself 7 05-21-2010 01:56 PM
Newcomer PBook UA PocketBook 34 12-10-2009 01:19 PM
Classic The Nook from a Kindler's POV jxh11215 Barnes & Noble NOOK 11 10-22-2009 12:06 AM
Hello from another clueless newcomer pamur Introduce Yourself 11 06-26-2009 09:37 PM
Another newcomer, signing in ottocrat Introduce Yourself 2 11-23-2007 03:24 PM


All times are GMT -4. The time now is 03:03 PM.


MobileRead.com is a privately owned, operated and funded community.