MobileRead Forums - View Single Post

kiwidude · 11-24-2010, 07:09 AM

Quote:

Originally Posted by bigpallooka

So it is really just a merging of book records, book X & Y into book Z with the metadata of book Z being used for all 3 reducing the total number of books by 2? Or combining multiple book records under the metadata and entry number of a specifically chosen book?

I.E. Not a merge of books but a combining under the metadata of the chosen book?

When this occurs in the auto-merge option when adding multiple books are the newly added books combined into the original version contained in the library?

If you want to see exactly what it does, the code for the manual merges at least is in edit_metadata.py.

As a rough summary for you - using your example above of clicking on Z first followed by X then Y and doing a merge. Z is the "master" version, any metadata fields that have been populated will be retained. However certain metadata fields if present in X & Y that differ from the values in Z will be merged in sequence, first the data from X then the data from Y if that has anything new.

The list of fields merged include title, authors, tags, cover, publisher, rating, series, comments and custom columns. For fields that can only contain a single value like title, then only the first value found will be used from X & Y (when Z does not have one already). For fields that can contain multiple values like tags, then a complete union is built. In addition the comments field is also compared and if it differs then it is appended to.

In terms of the book formats, the same logic applies. Any found in X & Y that already exist in Z will be deleted. If X & Y both have a new format that Z does not, the one from X will be used and Y's thrown away.

Personally I don't like certain aspects of the current behaviour and like you would want some finer control but someone has to write the code to do it. Given the complications and permutations I suspect that the only solution that would keep all of us happy is a dialog that displays the data from all the merge sources and the user picks what they want from each? The metadata sources unfortunately frequently contain utter rubbish. Poor quality covers, incrorrect publish dates, non-existent, garbage data or foreign language summaries, unrelated series, ridiculous tags etc. So I have to do a huge amount of manual editing and research to get my "Z" master exactly how I like it. Then for me when I merge I don't want anything but formats to be merged 99% of the time.

For instance I have three common reason for merging:
(1) It is a duplicate newly added (no metadata) record containing only new format(s) that I want to add to my master record.
(2) It is a duplicate newly added (no metadata) record containing format(s) that I want to merge with my master record.
(3) It is a duplicate record found at a later date where both have metadata but some differing formats to merge.

Scenario (1) is handled pretty well by the existing behaviour. Where it is not quite perfect for me is that if you automatically add tags when new records are added (like "00New") than you have to clear these out before you merge. In this situation I want to merge nothing but formats into the master.

Scenario (2) can be a very painful process. Say Z has an EPUB, and X has an EPUB and a MOBI format. You have to individually open each EPUB first to compare and decide which you want to keep (remembering which row was which EPUB if putting side by side!).

If the EPUB from Z is better, then it is as "perfect" as above - i.e. but for the tag merge of 00New. If however you prefer the EPUB from X, then you have three choices:
(2a) Remove the EPUB format from Z, then merge X into Z.
(2b) Delete Z and completely populate the metadata for X from scratch
(2c) Merge Z into X

I find (2a) is usually the desired choice as it will not overwrite any data from Z (except the 00New tag handling). The "tricky" part is the comparisons and making sure you remove the format from the right row and merge the correct order. There is no "undo" in Calibre so expect a visit to the recycle bin when you screw up.

For me (2b) is only an option if I hadn't yet setup my metadata for Z. Too much of a gamble on the work required to get the data right for X.

I used to use (2c) but it suffers from a flaw. Publish date is not merged. I suspect the reason is that the (poor imho) decision not to use nullable dates means that the application cannot determine that X does not have a publish date set yet and that it should in fact overwrite with the value from Z. I need to look into this but I wonder if the value it has would match the "Date" column for newly added records and based on this it could decide that a different value could be merged? I so wish Calibre had nullable dates - I would much rather have blanks displayed where no publish date is known.

Finally you have scenario (3) which can be a real problem. First you have to go through and decide which formats win as per (2) and decide on the remove format/merge/delete strategy. However if your titles or authors vary slightly then you can end up getting different metadata for the two rows. So for instance the comments fields may differ only by some line feeds - but Calibre will combine them together when you merge and you end up with double summaries.

The above describes the manual merging. Then you have the auto merging. I ended up writing my own program which does some "pre-processing" of all the files I am adding querying the Calibre database to divide the files into subdirectories and in the case of duplicate formats renaming them. If it is a new format of an existing Calibre book, the Calibre auto-merge works great and saves a lot of manual merging effort. However if it is a duplicate format of an existing book, it needs manual inspection to decide which wins, and you don't want Calibre to just automatically discard it. I think it would be better if the auto-merge behaviour gave an option to only auto-merge new formats, but create duplicate rows (marked with a tag perhaps) where you have the same format.

I have the same sorts of issues with the metadata download behaviour - where in a perfect world more granular control is desired so you could choose which data to overwrite. However I don't have the Python coding skills to help improve it unfortunately.