First Requestes and POV of a Newcomer

Giuseppe Chillem · 05-06-2010, 08:31 PM

Hello,
it is the second day I am using calibre.

Yesterday it crashed while importing my ebooks (over 1000). Today I have split the library in 10 parts to avoid the crash. It worked !

Here is the experience which results from the first 36h of use

1) There is no date-time of addition, only date. This is a problem: if you add files to a huge DB and then you want to rename whose have a bad import name you have to scan the whole DB. Having Date-Time together you can sort via this field and all the lastest entries will be at the top or at the bottom of the list. This will make finding bad book names easier imho.

2) As I have written the program crashed while importing 1041 books. This creates a dangerous situation. Calibre seems to recognize same file by bookname. You can have duplicates in 2 situations. The first: Filename is the Same; The second; Tags of the files are Unset and this lead to duplicates. If the program crashes and/or you reimport an already imported directory there will be both of those 2 kind of duplicates. Instead the program should be able to skip file which are physically the same from files with the same inferred tittle.
Physical Sameness should be taken into account when importing files. It is easy to understand if the same physical file has been alread imported and so it need to be skipped: use a combination of CRC32 and SIZE. Add those field to the DB. It will give 99.9999999% of accurancy.

3) Auto Tagging: Many of us have already created some directory structure to categorize books: one dir = one or more tags, all the same. During import it would be good to automatically select one or more tags to be set into the imported books

Ok, that's all.

Giuseppe Chillemi

Starson17 · 05-06-2010, 09:41 PM

Quote:

Originally Posted by Giuseppe Chillem

1) There is no date-time of addition, only date.

Calibre stores date and time, it just doesn't display the time. Sorting sorts by time as well as date.

Quote:

2) As I have written the program crashed while importing 1041 books.

I added over 8,000 books in one shot without a crash.

Quote:

Calibre seems to recognize same file by bookname. You can have duplicates in 2 situations. The first: Filename is the Same; The second; Tags of the files are Unset and this lead to duplicates. If the program crashes and/or you reimport an already imported directory there will be both of those 2 kind of duplicates. Instead the program should be able to skip file which are physically the same from files with the same inferred tittle.

If you turn on the option under Preferences |Add/Save| "If books with similar titles and authors found, merge" calibre will automatically skip the same book in the same format. If that option is off, it offers to add the book as a duplicate record.

Quote:

3) Auto Tagging: Many of us have already created some directory structure to categorize books: one dir = one or more tags, all the same. During import it would be good to automatically select one or more tags to be set into the imported books

You can add tags to groups of ebooks by selecting them and bulk adding a tag.

speakingtohe · 05-06-2010, 10:07 PM

Just curious. How do you tell the file is physically the same.? File name and size or scanning the whole book? If file name and size that is not foolproof but pretty easy for a person to do manually.
Calibre seems to have a much more sophisticated approach which does not always find all books the same but a remarkable number.
AFAIK it has not got all duplicates (impossible I think) but I imported 47,000 files and ended up with 21,000 ebooks.
I had sorted out the obvious duplicates based on file name/size (took about an hour). Needless to say I was impressed at how much work Calibre did for me. And unlike you imply it might have not detected some, but does not seem to have mismatched any. And Calibre does not ever destroy your original copy. Nothing dangerous there that I can see.

I don't imagine it will ever perform magic tasks such as figuring out each individual users file naming conventions and directory structures, but if you spend a little time using it it will make it easier for you to do this yourself.

For instance you could add the files from one directory or group of directories and use bulk edit to put in the appropriate tags.

BTW Calibre crashed for my first import try but did import my 47,000 files without crashing when I used my spare laptop solely for that purpose. Took about day to do it but it did it.

Helen

Worldwalker · 05-06-2010, 10:45 PM

I had a problem with crashes when I was importing about 1000 files, but that was with a build from a few months back. It's behaved itself ever since for me, at least.

To the OP, here's a way to get the right tags on the right files:

1. Mark all your existing books with a tag like the one I use: [processed]
2. Import your new books.
3. Search for those books that don't have a [processed] tag (I keep this as a saved search)
4. Control-A to select all the non-[processed] books you've found.
5. Bulk edit and put in your tag(s) of choice for that group.
Repeat 2-5 until you've imported them all.

Though ... thinking about it ... we have options to set the book title, etc., from the file name ... when importing books from a single folder, or a tree of folders, it would kind of be nice to have an option to have it automatically set a based on the folder name(s) starting with the one you selected. So if you start your import in \fiction, which has below it \mystery, \fantasy, and \sf, and below \sf you have \retro and \military, every book imported in that batch would get a "fiction" tag, plus, if relevant, "mystery", "fantasy", or "sf", and some of the "sf" books would get "retro" or "military" too.

The idea of automatically assigning tags on import has been requested before. So how about options like:

--------------------------------------------------------------
IMPORT AUTO-TAGGING OPTIONS
[ ] strip all existing tags
[ ] assign the following tag string: [______________________]
[ ] assign tags by folder names
--------------------------------------------------------------

So if I'm importing a bunch of books from PG, I could pick the first and third options, and it would ditch PG's crappy LoC tags and assign each one a tag of gutenberg, so I know where it came from.

I think Giuseppe's idea, and mine (and several other people's in the past) might be worth following up on. It's less of an issue for those of us who have our collections in calibre already, but it would sure make life easier on someone with a few thousand books to import, and make the transition from filesystem-as-metadata to tags-as-metadata much easier for newer users. (and the latter might qualify as a "fewer users bugging Kovid" class of improvement)

Starson17 · 05-07-2010, 08:59 AM

Quote:

Originally Posted by Worldwalker

I had a problem with crashes when I was importing about 1000 files, but that was with a build from a few months back. It's behaved itself ever since for me, at least.

Crashes are most often associated with importing pdf's. If he can avoid those in large imports, it may help.

Quote:

The idea of automatically assigning tags on import has been requested before. So how about options like:
... I think Giuseppe's idea, and mine (and several other people's in the past) might be worth following up on.

There are basically two options - write it yourself (calibre has an easy to set up development environment) or post a request in the bug tracker and hope someone else agrees with you that it's useful and decides to write it for you.

Worldwalker · 05-07-2010, 11:06 AM

Quote:

Originally Posted by Starson17

There are basically two options - write it yourself (calibre has an easy to set up development environment) or post a request in the bug tracker and hope someone else agrees with you that it's useful and decides to write it for you.

Hopefully a month or so from now, I'll have finished moving and be set up in my new place. Then I guess it'll be time to learn Python.

Starson17 · 05-07-2010, 12:05 PM

Quote:

Originally Posted by Worldwalker

Hopefully a month or so from now, I'll have finished moving and be set up in my new place. Then I guess it'll be time to learn Python.

You'll have fun. Python is easy to work in and contributions are always welcome.

Giuseppe Chillem · 05-07-2010, 01:02 PM

Quote:

Originally Posted by speakingtohe

Just curious. How do you tell the file is physically the same.? File name and size or scanning the whole book? If file name and size that is not foolproof but pretty easy for a person to do manually.
Helen

If you have a directory full of e-books, each ebook has size and CRC32 which are always the same. If you store the CRC32 into the DB of Calibre and you sercch for it (and file size) during a new import you are able to match with 100% accurancy if the file is duplicate and already been imported.
CRC32 an filesize is the way software which search for duplicated files adopt to find duplicates on the hard drives (a good program for this is CSPY, "Clone Spy")

Giuseppe Chillemi

Giuseppe Chillem · 05-07-2010, 01:11 PM

Quote:

Originally Posted by Starson17

Calibre stores date and time, it just doesn't display the time. Sorting sorts by time as well as date.

This is a good news.

Quote:

Originally Posted by Starson17

I added over 8,000 books in one shot without a crash.

The fact you were luky doesn't mean there isn't a bug inside the program :-)

Quote:

Originally Posted by Starson17

If you turn on the option under Preferences |Add/Save| "If books with similar titles and authors found, merge" calibre will automatically skip the same book in the same format. If that option is off, it offers to add the book as a duplicate record.

Here comes the program. Many books I have imported, while different, have the same tag. For example I have found a lot of "UNREGISTERE CHP Professional" duplicates.

CRC32 + FILESIZE gives a 100% accurate match on duplicates. It is very easy to implement, file size is already there and CRC32 can be 1) Calculated with a simple function available in phyton 2) Inherited form the file system which should have CRC32 stored in file header (if I am not wrong)

Quote:

Originally Posted by Starson17

You can add tags to groups of ebooks by selecting them and bulk adding a tag.

As time + date are already stored there is a simpler solution: sort the books by import date and then bulk add files.

Giuseppe Chillemi

Starson17 · 05-07-2010, 01:13 PM

Quote:

Originally Posted by Giuseppe Chillem

If you have a directory full of e-books, each ebook has size and CRC32 which are always the same. If you store the CRC32 into the DB of Calibre and you sercch for it (and file size) during a new import you are able to match with 100% accurancy if the file is duplicate and already been imported.

This is true, but it doesn't help you very much to know that it's 100% the same book. Maybe the user wants to add it anyway. I had lots of books that were CRC matched, but had different filenames. Each multiple author book was stored under both author names with the author name as part of the title. I wanted them added until I could edit the metadata and list the multiple authors on one copy, then delete the other.

It seems more useful to me to identify duplicates based on title and/or author, then ask. Most of my duplicates weren't 100% CRC duplicates anyway.

Starson17 · 05-07-2010, 01:22 PM

Quote:

Originally Posted by Giuseppe Chillem

The fact you were luky doesn't mean there isn't a bug inside the program :-)

The only known crashing relates to reading metadata from pdf books. That's due to the library calibre uses for reading pdf metadata and/or malformed pdfs, not calibre. Turn off pdf metadata reading or just don't import them with other books.

Quote:

Here comes the program. Many books I have imported, while different, have the same tag. For example I have found a lot of "UNREGISTERE CHP Professional" duplicates.

I'm not quite sure what you are saying, but if you are saying that you get duplicate titles, or other duplicate metadata, that often comes from importing books with bad internal metadata. I usually read metadata only from the filename, unless I'm sure that the books being imported have good internal metadata.

Quote:

You can add tags to groups of ebooks by selecting them and bulk adding a tag.

As time + date are already stored there is a simpler solution: sort the books by import date and then bulk add files.

It looks like you didn't understand what I was suggesting. I suggested that you add the books, sort them any way you want (by time added, author, etc.), select them and bulk edit the metadata to add the tags you want.

Giuseppe Chillem · 05-07-2010, 02:00 PM

Quote:

Originally Posted by Starson17

This is true, but it doesn't help you very much to know that it's 100% the same book. Maybe the user wants to add it anyway. I had lots of books that were CRC matched, but had different filenames. Each multiple author book was stored under both author names with the author name as part of the title. I wanted them added until I could edit the metadata and list the multiple authors on one copy, then delete the other.

It seems more useful to me to identify duplicates based on title and/or author, then ask. Most of my duplicates weren't 100% CRC duplicates anyway.

Believe me: if CRC32 + FILESIZE are the same in 2 books these 2 books are identical ! This is the tecnique used by all software which scan for duplicates in the hard drive.

There is only a 0,00000001% of chance they are different. So 100% is 99,99999999

Giuseppe Chillemi

kovidgoyal · 05-07-2010, 02:22 PM

The problem isn't that files with the same hash will be different, the problem is that files with different hashes may be the same. For example they may have slightly different metadata or have stored annotations, and still be logically the same book.

Starson17 · 05-07-2010, 03:03 PM

Quote:

Originally Posted by kovidgoyal

The problem isn't that files with the same hash will be different, the problem is that files with different hashes may be the same.

Exactly. During my importing process, I found ebooks I'd obtained from the Gutenberg Project years ago, and later versions of the same book from GP that had been edited to fix scanning errors. Many of those near-duplicates I'd originally obtained in format 1 had been converted to formats 2 and 3 in one of my mass conversion efforts, which produced more near-duplicates.

I'm not saying it's useless information to know which have the same hash, but Calibre can't use that information to automatically do anything for me. It will still have to ask what I want done. Sometimes if the hash matches, I want it added anyway (multiple author situations) and other times, even with hash differences, I don't want it added (it's the same book, but an earlier version without my bookmarks or with scanning errors not yet corrected).

Giuseppe Chillem · 05-07-2010, 04:19 PM

Quote:

Originally Posted by kovidgoyal

The problem isn't that files with the same hash will be different, the problem is that files with different hashes may be the same. For example they may have slightly different metadata or have stored annotations, and still be logically the same book.

You are right. Nice shot !

However, if you are in the early stage of book inporting (for example, merging book collections), and you have not changed metadata, with CRC32 + SIZE you have a 100% hit.

Thanks to your POV I whish to change a little my request.

Here is the target scenario:

Calibre crashes during inport. Part of files have been inported. Some of these files have metadata equal to other (I have found some CHM having the same "Generated by Unregistered Version"). If you discard duplicates, you discard false duplicates too. It does actually happen, I have ecnountered this problem just the first time I have used Calibre.

Here Is the proposal:

A two round check, the first is CRC32 + SIZE, the second is the actual mechanism. This would give you 3 lists: 1) physical duplicates, 2) Physical and Metadata Duplicates 3) Metadata Duplicates.

Then you request the user: DUPLICATES FOUND, what you want to delete ? "Same Physical Files; Same Physical Files + Metadata; Only Metadata; None"

What you think about this proposal ?

Giuseppe Chillemi

05-06-2010, 08:31 PM	#1
Giuseppe Chillem Groupie Posts: 191 Karma: 134 Join Date: May 2010 Device: IREX DR1000	First Requestes and POV of a Newcomer Hello, it is the second day I am using calibre. Yesterday it crashed while importing my ebooks (over 1000). Today I have split the library in 10 parts to avoid the crash. It worked ! Here is the experience which results from the first 36h of use 1) There is no date-time of addition, only date. This is a problem: if you add files to a huge DB and then you want to rename whose have a bad import name you have to scan the whole DB. Having Date-Time together you can sort via this field and all the lastest entries will be at the top or at the bottom of the list. This will make finding bad book names easier imho. 2) As I have written the program crashed while importing 1041 books. This creates a dangerous situation. Calibre seems to recognize same file by bookname. You can have duplicates in 2 situations. The first: Filename is the Same; The second; Tags of the files are Unset and this lead to duplicates. If the program crashes and/or you reimport an already imported directory there will be both of those 2 kind of duplicates. Instead the program should be able to skip file which are physically the same from files with the same inferred tittle. Physical Sameness should be taken into account when importing files. It is easy to understand if the same physical file has been alread imported and so it need to be skipped: use a combination of CRC32 and SIZE. Add those field to the DB. It will give 99.9999999% of accurancy. 3) Auto Tagging: Many of us have already created some directory structure to categorize books: one dir = one or more tags, all the same. During import it would be good to automatically select one or more tags to be set into the imported books Ok, that's all. Giuseppe Chillemi

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Newcomer	falchion	Introduce Yourself	7	05-21-2010 02:56 PM
Newcomer	PBook UA	PocketBook	34	12-10-2009 02:19 PM
Classic The Nook from a Kindler's POV	jxh11215	Barnes & Noble NOOK	11	10-22-2009 01:06 AM
Hello from another clueless newcomer	pamur	Introduce Yourself	11	06-26-2009 10:37 PM
Another newcomer, signing in	ottocrat	Introduce Yourself	2	11-23-2007 04:24 PM

05-06-2010, 10:07 PM	#3
speakingtohe Wizard Posts: 4,812 Karma: 26912940 Join Date: Apr 2010 Device: sony PRS-T1 and T3, Kobo Mini and Aura HD, Tablet	Just curious. How do you tell the file is physically the same.? File name and size or scanning the whole book? If file name and size that is not foolproof but pretty easy for a person to do manually. Calibre seems to have a much more sophisticated approach which does not always find all books the same but a remarkable number. AFAIK it has not got all duplicates (impossible I think) but I imported 47,000 files and ended up with 21,000 ebooks. I had sorted out the obvious duplicates based on file name/size (took about an hour). Needless to say I was impressed at how much work Calibre did for me. And unlike you imply it might have not detected some, but does not seem to have mismatched any. And Calibre does not ever destroy your original copy. Nothing dangerous there that I can see. I don't imagine it will ever perform magic tasks such as figuring out each individual users file naming conventions and directory structures, but if you spend a little time using it it will make it easier for you to do this yourself. For instance you could add the files from one directory or group of directories and use bulk edit to put in the appropriate tags. BTW Calibre crashed for my first import try but did import my 47,000 files without crashing when I used my spare laptop solely for that purpose. Took about day to do it but it did it. Helen

05-06-2010, 10:45 PM	#4
Worldwalker Curmudgeon Posts: 3,085 Karma: 722357 Join Date: Feb 2010 Device: PRS-505	I had a problem with crashes when I was importing about 1000 files, but that was with a build from a few months back. It's behaved itself ever since for me, at least. To the OP, here's a way to get the right tags on the right files: 1. Mark all your existing books with a tag like the one I use: [processed] 2. Import your new books. 3. Search for those books that don't have a [processed] tag (I keep this as a saved search) 4. Control-A to select all the non-[processed] books you've found. 5. Bulk edit and put in your tag(s) of choice for that group. Repeat 2-5 until you've imported them all. Though ... thinking about it ... we have options to set the book title, etc., from the file name ... when importing books from a single folder, or a tree of folders, it would kind of be nice to have an option to have it automatically set a based on the folder name(s) starting with the one you selected. So if you start your import in \fiction, which has below it \mystery, \fantasy, and \sf, and below \sf you have \retro and \military, every book imported in that batch would get a "fiction" tag, plus, if relevant, "mystery", "fantasy", or "sf", and some of the "sf" books would get "retro" or "military" too. The idea of automatically assigning tags on import has been requested before. So how about options like: -------------------------------------------------------------- IMPORT AUTO-TAGGING OPTIONS [ ] strip all existing tags [ ] assign the following tag string: [______________________] [ ] assign tags by folder names -------------------------------------------------------------- So if I'm importing a bunch of books from PG, I could pick the first and third options, and it would ditch PG's crappy LoC tags and assign each one a tag of gutenberg, so I know where it came from. I think Giuseppe's idea, and mine (and several other people's in the past) might be worth following up on. It's less of an issue for those of us who have our collections in calibre already, but it would sure make life easier on someone with a few thousand books to import, and make the transition from filesystem-as-metadata to tags-as-metadata much easier for newer users. (and the latter might qualify as a "fewer users bugging Kovid" class of improvement)

05-07-2010, 02:22 PM	#13
kovidgoyal creator of calibre Posts: 46,056 Karma: 29579868 Join Date: Oct 2006 Location: Mumbai, India Device: Various	The problem isn't that files with the same hash will be different, the problem is that files with different hashes may be the same. For example they may have slightly different metadata or have stored annotations, and still be logically the same book.

Advert

Advert