Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Plugins

Notices

Reply
 
Thread Tools Search this Thread
Old 06-25-2020, 06:56 AM   #736
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 21,740
Karma: 30237526
Join Date: Mar 2012
Location: Sydney Australia
Device: none
@davidfor & dunhill - all done, I've removed the attachments from your posts.

take care

BR
BetterRed is offline   Reply With Quote
Old 07-19-2020, 06:11 AM   #737
capink
Wizard
capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.
 
Posts: 1,196
Karma: 1995558
Join Date: Aug 2015
Device: Kindle
Edit: This feature (metadata variation custom column support) is now added to the advanced mode. Look at next posts for more details.

Last edited by capink; 08-19-2020 at 05:28 AM. Reason: removing link - feature now part of advanced mode added in later postss
capink is offline   Reply With Quote
Old 08-09-2020, 11:17 AM   #738
capink
Wizard
capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.
 
Posts: 1,196
Karma: 1995558
Join Date: Aug 2015
Device: Kindle
Here is my latest update to this plugin. It adds an Advanced Mode to the plugin. Before detailing the features of the advanced mode, I'd like to underline a couple of things:
  • The new update retains the old functionality of the plugin without modifying it at all. The advanced mode has separate dialogs, and even inside the plugin zip it is totally separated in its own directory (except for the few lines that create the menu entries).
  • The whole point of this update is to make the plugin more flexible and extensible. Some of the features I don't use currently, but I tried the make the design as flexible as possible to meet any future needs.

Now, here is list of the advanced mode features:
  1. The advanced mode, like the original plugin, has three dialogs: Book duplicates dialog, Cross library dialog, Metadata variations dialog.
  2. You can match by any column you want. whether standard or custom.
    • For book duplicates and cross library dialogs you can use any custom or standard column you want.
    • For Metadata variations you can use any standard or custom column as long as its datatype is text or series.
  3. Composite columns are supported in both book and cross library dialogs.
  4. No limit for how many columns you can match by. You are no longer bound by two columns. This is done in the spirit of making the plugin more flexible. However, I have not matched by more than three columns.
  5. Support for user defined algorithms through templates. You can use templates directly in the plugin itself, or indirectly through composite columns. We will show examples of how to do both later in this post. Of the most notable examples below are:
    • matching by alias instead of author
    • adding language specific modifications to the matching algorithm.
  6. The ability to use more than one matching algorithm for each column. This way you can supplement the builtin matching algorithms with your tailor made algorithms. We will show examples for this below, starting with the simple ones and moving gradually into more complex examples.
  7. Sort dialog enabling you to control how the books are sorted within the duplicate groups. This is complemented with marking the first and last book of each group. You can read more on this in this post.


Before you read examples on using builtin and user-defined template functions, you should visit this page for general info about templates in Calibre.

You do not have to know any programming language to use Calibre's builtin templates. User-defined template functions, however, are written in python. You can use template functions written by others, or use any of the template functions in the examples below. They are written in way that makes them easy to be used and modified for a variety of contexts.

Example No. 1: Replicate original plugin behavior in the advanced mode (Similar title, Soundex author)

Spoiler:

This example is used to familiarize ourselves with the advanced mode. We will show more advanced uses in later examples.
  • First, from the Find duplicate menu select: Advanced Mode > Find Book Duplicates... (see attachment 1)
  • In the first match rule, select title field from the combobox (see attachment 2)
  • Press the add algorithms button to add similar match algorithm (see attachment 3)
  • Press add match rules, and in the new match rules repeat the previous steps, only this time choosing author field and the "soundex match". When you click to add the soundex algorithm, you will notice a settings button appearing next to, click it to change the soundex length to 8. (see attachment 4)
  • Finally, click OK to start the duplicate search. After the the plugin completes its search, you will be greeted with the familiar message detailing how many duplicates were found. Also duplicates will be listed on the main view exactly as in the normal mode.

Note: As soon as you choose the authors field from the combobox, you will notice the checkbox "match any of the items" appearing beneath the field. This should be left checked in most cases. We will explain what it does later.



Example No. 2: Use a builtin template to substitute the plugin matching algorithms

Spoiler:

This is another relatively simple example used to familiarize ourselves with templates. Here we will use the builtin template "transliterate". This template will remove diacritics from text, so that the following two books can match:
Code:
Les Misérables
Les Miserables
The plugin's "similar" match algorithm can already do this, but it also does other more things like removing title articles plus other things, so we will substitute it with this simpler slightly less aggressive template.
  • In the first match rule choose the title field from the combobox like we did in the previous example.
  • In the add algorithm dialog, instead of choosing one of the plugin's algorithm, click in the add template button in the upper right corner, and type the following template in the template dialog (see attachment 5):

    Code:
    {title:transliterate()}
    Also to make the match case insensitive we will add this second template:

    Code:
    {title:lowercase()}
  • Repeat the same step for the authors field, but modify the templates to look like this:

    Code:
    {authors:transliterate()}
    {authors:lowercase()}
  • Press OK to proceed with the matching.



Example No. 3: How to use custom (user defined) template functions to add language specific matching.

Spoiler:

In this example we will use a user-defined template function to add language specific modifications to our match algorithm. Suppose I want to remove French articles to improve the match algorithm for French books. Here is general purpose template that can be used to remove selected words, that we will use here for this purpose:

Code:
def evaluate(self, formatter, kwargs, mi, locals, val, col_name, transliterate=True):
    import re
    asci = lambda x: x
    if transliterate:
        from calibre.utils.filenames import ascii_text as asci
    ar = ['le,' 'la', 'les', 'au', 'aux', 'du', 'des']
    ar_rep = ['^l’']
    r = re.compile(r'({})'.format('|'.join([asci(x) for x in ar_rep])), re.I)
    SEP = mi.metadata_for_field(col_name)['is_multiple'].get('list_to_ui','')

    new_val = []
    if SEP:
        val_ = val.split(SEP)
    else:
        val_ = [val]
    for item in val_:
        tokens = [ asci(tok.lower()) for tok in item.split() if not asci(tok.lower()) in [asci(x) for x in ar] ]
        tokens = [r.sub('', tok) for tok in tokens]
        new_val.append(' '.join(tokens))
    return SEP.join(new_val)
Note: I do not know French. I got these words from a google search, the list of words to remove in this template should be modified by the user to suit his own needs.
  • First, we have to add this template to Calibre by going to preferences > templates (see attachment 6)

    copy the code above into the program's code box. And make sure other settings for the template look like this:
    Code:
    Function: remove_tokens
    Argument count: 2
    Now, press create to permanently add this template to calibre.
  • For the first match rule select the title field from the combobox.
  • Add this template to in the algorithm dialog:

    Code:
    {title:remove_tokens(title)}
    We already covered how to enter templates in the algorithms dialog in the previous example no. 2.
    And if you want, you can add another algorithm (Soundex, Similar or Fuzzy) to do more processing to the result of the first template.
  • Repeat the same for authors, but modify the template to be:

    Code:
    {authors:remove_tokens(title)}

NOTE: The template defined above can be used for removal of other unwanted words not just language specific words. You can modify the list of words to suit your own needs. For example, the original plugin's "similar" algorithm removes words like: omnibus, anthology, edition, paperback, hardcover ...etc. If you think of any similar words to remove, you can modify the above template to achieve this.


NOTE: You can further refine the previous template by making it use conditional matching. Say there are certain actions that make sense in one language, but are counter productive in another. Since templates can see the whole book metadata, we can get the book language from the language field, and make actions based on language. We will see an example of conditional matching later. (See examples 6 & 7)



Example No. 4 How to match by alias instead of author.

Spoiler:

This is one of the most interesting examples on how to use templates to find duplicates. It assumes that you are already using user categories in Calibre to set up author aliases as this post by @BetterRed suggests (which is further illustrated here in this attachment).

After creating your aliases as explained above, we will use the template below, which will substitute the author name with the alias if it finds one in the user categories, otherwise it returns the author name unchanged. We can use this template directly as in previous examples, but this time it makes sense to use the template to create a composite column containing aliases that can used for other things, beside duplicate finding, like searching in Calibre by alias.
  • Add the code below to Calibre templates (we have already how to do this explained in example No. 3).

    Code:
    def evaluate(self, formatter, kwargs, mi, locals, val, col_name, user_cat_prefix):
        new_val = ''
        if hasattr(mi, '_proxy_metadata'):
            all_cats = mi.user_categories
            cats = {k:v for k,v in all_cats.items() if k.startswith(user_cat_prefix)}
            SEP = mi.metadata_for_field(col_name)['is_multiple'].get('list_to_ui', '')
            new_val = set()
            if SEP:
                val_ = val.split(SEP)
            else:
                val_ = [val]  
            prefix_length = len(user_cat_prefix)
            for user_cat, v in cats.items():
                repl = user_cat[prefix_length:]
                for user_cat_item, src_cat in v:
                    if src_cat == col_name:
                        for item in val_:
                            if item == user_cat_item:
                                new_val.add(repl)
                                val_.remove(item)
                            else:
                                new_val.add(item)
            if new_val:
                return SEP.join(list(new_val))
        return val
    The rest of the the template setting should look like this:
    Code:
    Function: replace_with_category
    Argument count: 3
    Now, press create button to add the code to Calibre's templates.
  • Create a new composite column with these exact settings (see attachment 8):

    Code:
    Lookup name: alias
    Column heading: alias
    Column type: Column built from other columns, behaves like tags
    Template: {authors:replace_with_category(authors,Authors.Alias.)}
    You will have to restart Calibre for the new column to be effective.

    Note: If you don't choose the "column built from other columns, behave like tags" the duplicate search will not work.

    Note: if you decide to use different structure for your user category, you have to replace Authors.Alias. with whatever the user category hierarchy you are using

    Note: In simple template mode, spaces are significant. Don't add any space after the comma.
  • Now, to start matching by alias choose the title field and whatever algorithm you want (I will choose similar match for the title).
  • Instead of using the authors column we will use the newly built alias column, you will notice as the soon as you choose this column, a new "contains names" checkbox appears (see attachment 9), you have to check it for this duplicate search to work.

    This setting tells the plugin to split contents of the alias column using "&" as it does with the authors.

    Note that we don't need this option when dealing with the authors column — on any non-composite custom column — as the plugin can know from their metadata that they contain names, and acts accordingly. Unfortunately, Calibre does not give us an option when we create a composite column to indicate that it contains names, so we have to use this option to tell the plugin what to do.
  • Now choose an algorithm you want for the alias column (you can choose identical here) and click OK to start the find duplicate search.

Note: The above template can be used for purposes other than aliases. Any user categories you apply for series or authors (like nationality) can be added to a composite column appearing in books.



Example No. 5: How to use "match any of the items" option

Spoiler:

As we have noticed before, as soon as we choose a column that contains multiple items, like authors or tags, a "match any of the items" checkbox appears. In most cases this option should be checked, as it makes sure that a book with multiple authors (or items) will match other books containing at least one similar author or item, and not necessarily all of them. (This is the default behavior of the plugin in the normal mode.)

To illustrate this point let us look at these five books:

Code:
title: Brothers Kramazov | authors: Fyodor Dostoyevsky & David Mcduff
title: Brothers Kramazov | authors: David Mcduff & Fyódor Dostoyévsky
title: Brothers Kramazov | authors: Fyodor Dostoyevsky & David Mcduff (trans.)
title: Brothers Kramazov | authors: Fyodor Dostoyevsky & Larissa Volokhonsky
title: Brothers Kramazov | authors: Fyodor Dostoyevsky
If the "match any of the items" option is checked, and we use the similar algorithm, all books will match because they have at least one common author: Fyodor Dostoyevsky. The similar match algorithm will take care of slight differences like diacritics.

If we want to match only the books that share all authors (or items) and not just one of them, you have to un-check the option "match any of the items". When you run the search this time with this option unchecked, you will notice the following:
  • Only the first two books match. They share all the authors, not just one. Note that they have their authors in different order. This does not matter for the plugin, as it acts on each author separately and then concatenates them after sorting alphabetically.
  • The last two books predictably failed to match, because they do not share all the authors with any of the other books.
  • The third book surprisingly failed to match even though it shares all the authors with the first two. This is because the similar algorithm does not remove the text in parenthesis for authors (it does for the title though).

Since having author roles like translators or editors ... etc enclosed in parenthesis as part of the author name is a common occurrence, we can get around this problem by simply adding a the following template before the similar match for the author column
Code:
{authors:re(\(.+\),)}
this uses the builtin template function re to replace anything enclosed inside parenthesis





Example No. 6: Using a user-defined template function for conditional matching

Spoiler:

One of the plugins matching algorithms (fuzzy match), removes subtitles from book titles. So if a book have the following title:

Code:
Flow Down Like Silver: Hypatia of Alexandria
The algorithms will make it look like this before matching

Code:
Flow Down Like Silver
This make sense in some cases. But sometimes the subtitle contains the book name. Consider for example a book with this title:

Code:
Discworld: The Color of Magic
Here, the subtitle is the actual book title. So before we act on books with the plugin's builtin algorithm, we will write template to deal with this situation. The code for the template is below

Code:
def evaluate(self, formatter, kwargs, mi, locals, val):
    if val.find(':') == -1:
        return val
    title, sep, subtitle = val.partition(':')
    series = mi.series
    if not series:
        return val
    if title.strip().lower() == series.strip().lower():
        return subtitle
    return val


This template checks for a book a series in the series column, and if it matches the part before the colon, it will modify the title to remove the series name from it, before handing the sanitized name to the plugin's builtin algorithm.

Now we match by entering the following template in the title algorithm dialog:

Code:
{title:remove_series_from_title()}
After this we add the plugin's fuzzy match to act on the sanitized name.

The template above should theoretically work, but it has one serious flaw: it only works if the series name in the title exactly matches the series obtained from the series field, letter for letter. Any slight variation and it will fail. So, we will further improve this template in the next example to address this.



Example No. 7: How to use the plugin's builtin functions inside our templates

Spoiler:

To deal with the flaw in the template in the previous example, we will import a function from the find duplicate plugin that helps us match series names with slight variations in them. We will modify the template to look like this:

Code:
def evaluate(self, formatter, kwargs, mi, locals, val):
    if val.find(':') == -1:
        return val
    title, sep, subtitle = val.partition(':')
    series = mi.series
    if not series:
        return val
    from calibre_plugins.find_duplicates.matching import similar_series_match
    if similar_series_match(title) == similar_series_match(series):
        return subtitle
    return val



Example No. 8: Tag management using similar algorithm and builtin templates:

Spoiler:


In this example will use the metadata variations to de-clutter our tags by getting rid of duplicate tags. We will use the similar match algorithm which already does a good job of finding duplicates. We will enhance it with templates to do an even better job.

The advanced metadata variations is used in exactly the same way we have used the Find Book Duplicates Dialog, with the only difference is that we have a single match rule for a single column.

To start this example, we do the tag match using the similar match as we have done before, doing this on my test library there are some deficiencies that need to be addressed:
  1. The following pairs of tags do not qualify as duplicates

    Code:
    analytics
    analytic
    Code:
    budget
    Budgeting
    Code:
    Cartooning
    cartoons
    To address this, we will add a template that uses builtin functions to remove 's' and 'ing' from the end of words in tags, so that the tags above can match. To do this we add the following template to the similar match algorithm (As we have demonstrated before):

    Code:
    {tags:re((e?s\b|ing\b),)}
  2. The similar match can match hierarchical tags regardless of separator, like the pair below:

    N.B. There is one separator that makes this fail, we will discuss it at the end and see how to correct it.

    Code:
    Fiction.Thrillers.Suspense
    Fiction ::: Thrillers ::: Suspense
    This is really useful, but it still leaves a lot to be desired. For example the following pairs of tags fail to match:

    Code:
    Crime::Mystery::Thriller
    Thrillers.Crime.Mystery

    Code:
    Crime & mystery
    Mystery & Crime
    The two pairs above have a different sort order, which the builtin similar match does not accout for. We will correct this be using the template below which sorts the tags before matching them:

    Code:
    {tags:list_sort(0, )}
    Note: space is used as list separator in the above template. We will explain why in the next point.

    Note: We add the above template to the similar match + the template we added before to match plural "thrillers" with singular "thriller".


  3. Even after adding the previous template, there is one case when hierarchical tags fail to match. Out of the three tags below, the first two match, while the third fails to match:

    Code:
    Thrillers.Crime.Mystery
    Crime / Mystery / Thriller
    Crime/Mystery/Thriller
    The problem here is not the sort order which was taken care of in the previous example, the problem is that the slash is not processed as other separators by the similar match algorithm. To understand this better we need to know how the similar algorithm works, which is explained briefly below:

    Quote:
    The similar algorithm does four things:
    1. It removes some special characters.
    2. It replaces some other characters with a space.
    3. It concatenates multiple adjacent spaces into single one.
    4. It converts all characters to ascii lower case characters.

    Most separators (like dots and colons) are replaced with a space. The slash however, is removed without being replaced by a space. So, applying the rules above the tags will evaluate as follows:

    first tag will evaluate to:

    Code:
    thrillers crime mystery
    The second will evaluate to:

    Code:
    crime mystery thriller
    The third will evaluate to:

    Code:
    crimemysterythriller
    The first two have the following differences:
    • One of them has the plural form "thrillers". The first template we wrote takes care of that.
    • They have a different sort order. The second template takes care of that as well.

    So, the first two will match.

    If you want the slash to be treated as other separators, you will have to add a this template before the similar match acts on the tags:

    Code:
    {tags:re(/, )}
    The above template replaces any slash with a space.

    Note: The order here is important. You must add this template before the similar match algorithm. If you put it after it, it will not have any effect.



TIP: Since you are no longer bound only by the mandatory title and author columns, you might have a situation where you exclusively use custom columns for matching. These columns can have no values in a lot of cases. So if you are matching books based on custom_column1 and custom_column2, and one of them don't have value for certain books, you are effectively matching the books based on the column that have value alone.

This situation can be avoided by using virtual library as follows:

In the search bar type a search like this:

Code:
#custom_column1:true #custom_column2:true
And make a virtual library out of the above search by pressing (Ctrl + Shift + *). Now you can open find duplicates and it will only include the books that have values for both columns.

Update: Using filters to sort results. see this post for details.



Final Notes:
  • In the normal mode, the plugin provides an author only algorithm, this was probably done before the metadata variations feature was added. The advanced mode does not support an author only algorithm as it is better to use the metadata variations dialog for this kind of search. You can use a match rule containing only the author field but it will use the same algorithm.
  • Using templates (either builtin or user-defined) slightly affect performance. This happens because whenever templates are used, the plugin must fetch the metadata object for every book because that's how templates work. So the biggest performance hit happens when you add the first template. Adding other templates will not affect performance as much as the first template does.

    In my testing on an average laptop, this adds about 1 second per 1000 books, so it should hardly be noticeable on most libraries. However, if you have a huge library (tens of thousands), it will take more time to process duplicates, even then, it usually finishes in under a minute.
  • The normal mode has an option to add language to the title. This option is not needed and thus removed from the advanced mode, since you can add a match rule containing the language column.
  • There are some situations where the order of the algorithms matter (look at examples 6 & 7), that is why we have buttons to move the algorithms up and down.
  • I have no use for the cross library duplicate search, I added it because most of the work was already done in the book duplicate dialog. So this is the least tested part of the new updates.
  • Whenever you enter a template directly into the plugin, it evaluates the template and tries to catch any error and prevent the user from proceeding if the template is not valid. I tried to cover all possible errors, but I am yet to find a reliable way to make the template either produce a valid result or fail, even using unsafe_format it still produces errors without raising exceptions.

    This should not be a big problem, and if it ever happens will lead to some false positives, However, the real concern for me here is someone is using only templates on a cross library match, and all templates fail, producing the same error message for all books, we might end up with situation where tens of thousands of books in one library matching tens of thousands of books in the target library which might freeze your pc. So in light of this, it is better to test your templates and make sure they are working if you are using them exclusively in a cross library duplicate search.

Acknowledgements
  • Thanks for Kovid and the rest of the Calibre team for creating what is the most well designed, flexible piece of software I've come across.
  • Thanks for kiwidude for creating this, and also for his other awesome plugins. I got much value from them and they made Calibre an even better program. And when I later started to work on his code, I learned from it more than any other resource.
  • Thanks for chaley for creating templates and other interesting features in Calibre.
  • Thanks for davidfor for maintaining this plugin as well as other kiwidude's plugin. The same goes for JimmXinu.
  • Thanks for BetterRed for his idea on how to use user categories to add author pen names.
Attached Thumbnails
Click image for larger version

Name:	1.jpg
Views:	746
Size:	21.6 KB
ID:	181199   Click image for larger version

Name:	2.jpg
Views:	706
Size:	22.9 KB
ID:	181200   Click image for larger version

Name:	3.jpg
Views:	714
Size:	28.4 KB
ID:	181201   Click image for larger version

Name:	4.jpg
Views:	718
Size:	26.3 KB
ID:	181202   Click image for larger version

Name:	5.jpg
Views:	630
Size:	30.5 KB
ID:	181203   Click image for larger version

Name:	6.jpg
Views:	626
Size:	146.6 KB
ID:	181204   Click image for larger version

Name:	7.jpg
Views:	636
Size:	32.4 KB
ID:	181205   Click image for larger version

Name:	8.jpg
Views:	580
Size:	44.7 KB
ID:	181209   Click image for larger version

Name:	9.jpg
Views:	571
Size:	7.7 KB
ID:	181210  

Last edited by capink; 02-03-2022 at 08:13 AM. Reason: Adding link for new updates
capink is offline   Reply With Quote
Old 08-14-2020, 05:56 AM   #739
capink
Wizard
capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.
 
Posts: 1,196
Karma: 1995558
Join Date: Aug 2015
Device: Kindle
Version 1.7.5
  • Fix: Change the way the advanced mode deals with algorithms that generate an additional reverse hash (similar, soundex), so they fit better with multiple algorithms working together when the option "match any of items" is turned off.
  • Update: Update Spanish translation. Thanks to @dunhill.
  • Remove the previously added custom column support in the Find metadata variations dialog as it is now included in the advanced mode.

Also the previous post is updated with the this note:

Since you are no longer bound only by the mandatory title and author columns, you might have a situation where you exclusively use custom columns for matching. These columns can have no values in a lot of cases. So if you are matching books based on custom_column1 and custom_column2, and one of them don't have value for certain books, you are effectively matching the books based on the column that have value alone.

This situation can be avoided by using virtual library as follows:

In the search bar type a search like this:

Code:
#custom_column1:true #custom_column2:true
And make a virtual library out of the above search by pressing (Ctrl + Shift + *). Now you can open find duplicates and it will only include the books that have values for both columns.

Last edited by capink; 10-08-2020 at 10:08 AM. Reason: remove attachment. newer version available
capink is offline   Reply With Quote
Old 08-18-2020, 09:41 AM   #740
capink
Wizard
capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.
 
Posts: 1,196
Karma: 1995558
Join Date: Aug 2015
Device: Kindle
Update post 738 to use builtin template:

Code:
{authors:re(\(.+\),)}
In example 5 instead of using a user defined function to remove author roles.

Also added the following example:

Example No. 8: Tag management using similar algorithm and builtin templates:

Spoiler:


In this example will use the metadata variations to de-clutter our tags by getting rid of duplicate tags. We will use the similar match algorithm which already does a good job of finding duplicates. We will enhance it with templates to do an even better job.

The advanced metadata variations is used in exactly the same way we have used the Find Book Duplicates Dialog, with the only difference is that we have a single match rule for a single column.

To start this example, we do the tag match using the similar match as we have done before, doing this on my test library there are some deficiencies that need to be addressed:
  1. The following pairs of tags do not qualify as duplicates

    Code:
    analytics
    analytic
    Code:
    budget
    Budgeting
    Code:
    Cartooning
    cartoons
    To address this, we will add a template that uses builtin functions to remove 's' and 'ing' from the end of words in tags, so that the tags above can match. To do this we add the following template to the similar match algorithm (As we have demonstrated before):

    Code:
    {tags:re((e?s\b|ing\b),)}
  2. The similar match can match hierarichal tags regardless of separator, like the pair below:

    N.B. There is one separator that makes this fail, we will discuss it at the end and see how to correct it.

    Code:
    Fiction.Thrillers.Suspense
    Fiction ::: Thrillers ::: Suspense
    This is really useful, but it still leaves a lot to be desired. For example the following pairs of tags fail to match:

    Code:
    Crime::Mystery::Thriller
    Thrillers.Crime.Mystery

    Code:
    Crime & mystery
    Mystery & Crime
    The two pairs above have a different sort order, which the builtin similar match does not accout for. We will correct this be using the template below which sorts the tags before matching them:

    Code:
    {tags:list_sort(0, )}
    Note: space is used as list separator in the above template. We will explain why in the next point.

    Note: We add the above template to the similar match + the template we added before to match plural "thrillers" with singular "thriller".


  3. Even after adding the previous template, there is one case when hierarchical tags fail to match. Out of the three tags below, the first two match, while the third fails to match:

    Code:
    Thrillers.Crime.Mystery
    Crime / Mystery / Thriller
    Crime/Mystery/Thriller
    The problem here is not the sort order which was taken care of in the previous example, the problem is that the slash is not processed as other separators by the similar match algorithm. To understand this better we need to know how the similar algorithm works, which is explained briefly below:

    Quote:
    The similar algorithm does four things:
    1. It removes some special characters.
    2. It replaces some other characters with a space.
    3. It concatenates multiple adjacent spaces into single one.
    4. It converts all characters to ascii lower case characters.

    Most separators (like dots and colons) are replaced with a space. The slash however, is removed without being replaced by a space. So, applying the rules above the tags will evaluate as follows:

    first tag will evaluate to:

    Code:
    thrillers crime mystery
    The second will evaluate to:

    Code:
    crime mystery thriller
    The third will evaluate to:

    Code:
    crimemysterythriller
    The first two have the following differences:
    • One of them has the plural form "thrillers". The first template we wrote takes care of that.
    • They have a different sort order. The second template takes care of that as well.

    So, the first two will match.

    If you want the slash to be treated as other separators, you will have to add a this template before the similar match acts on the tags:

    Code:
    {tags:re(/, )}
    The above template replaces any slash with a space.

    Note: The order here is important. You must add this template before the similar match algorithm. If you put it after it, it will not have any effect.

Last edited by capink; 08-19-2020 at 05:39 AM. Reason: correcting typs
capink is offline   Reply With Quote
Old 08-30-2020, 11:05 AM   #741
jony08
Connoisseur
jony08 began at the beginning.
 
Posts: 91
Karma: 10
Join Date: Jun 2016
Device: Kobo Aura
Please add a function to automatically delete one of the duplicates if it has a certain format compared to the other. For example, I want to delete all PDF files automatically if another format is available.
jony08 is offline   Reply With Quote
Old 08-30-2020, 12:28 PM   #742
Tanjamuse
Wizard
Tanjamuse , Klaatu Barada Niktu!Tanjamuse , Klaatu Barada Niktu!Tanjamuse , Klaatu Barada Niktu!Tanjamuse , Klaatu Barada Niktu!Tanjamuse , Klaatu Barada Niktu!Tanjamuse , Klaatu Barada Niktu!Tanjamuse , Klaatu Barada Niktu!Tanjamuse , Klaatu Barada Niktu!Tanjamuse , Klaatu Barada Niktu!Tanjamuse , Klaatu Barada Niktu!Tanjamuse , Klaatu Barada Niktu!
 
Posts: 1,327
Karma: 5306
Join Date: Jan 2014
Device: none
Or any other columns? Word count is lowest or last-edited date?
Tanjamuse is offline   Reply With Quote
Old 08-30-2020, 07:05 PM   #743
capink
Wizard
capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.
 
Posts: 1,196
Karma: 1995558
Join Date: Aug 2015
Device: Kindle
Quote:
Originally Posted by jony08 View Post
Please add a function to automatically delete one of the duplicates if it has a certain format compared to the other. For example, I want to delete all PDF files automatically if another format is available.
Quote:
Originally Posted by Tanjamuse View Post
Or any other columns? Word count is lowest or last-edited date?


I think this issue has been addressed more than once by kiwidude. It is outside the scope of this plugin to decide which books to keep and which to delete. He also noted several times that this is better implemented in a separate plugin.

And even if one is to write a separate plugin to handle this, there is a lot of difficulties in implementing it; most obvious is the question:

How to decide which book(s) to delete and which to keep? Every user has his own set of criteria which makes it difficult to write a plugin that satisfies the need of all users (short of writing a separate plugin or routine for each user).

So the best way is for each user to implement his own routine, by writing python scripts and running them through calibre-debug. This has the obvious problem that most users don't code and cannot go down this path. But even for people who can code, and want to write their own scripts to handle their unique individual needs this can be challenging:
  • Let's take for example the first request of deleting books that have only pdf formats if other formats exists. A lot of time you find that you have two duplicate entries each containing a pdf and epub formats. Now you have to decide which one of them to delete. Do I delete one of them randomly? Or maybe I should implement another set of criteria for such occurrences.
  • One solution to this is to keep the last edited, as suggested in the second request. But now I have another problem, a lot of times they will all have the same modification time because calibre resets the modification date in a lot of situations (for example, whenever you add a custom column calibre resets the modification date for all books in the library).
So now I have to decide on some additional criteria to determine which books to delete, which will lead me further down the rabbit hole, until I finally realize it actually easier to manually choose which books to delete from the GUI.

That being said if someone can and want to implement this feature, all power to them.
capink is offline   Reply With Quote
Old 08-31-2020, 12:36 PM   #744
Tanjamuse
Wizard
Tanjamuse , Klaatu Barada Niktu!Tanjamuse , Klaatu Barada Niktu!Tanjamuse , Klaatu Barada Niktu!Tanjamuse , Klaatu Barada Niktu!Tanjamuse , Klaatu Barada Niktu!Tanjamuse , Klaatu Barada Niktu!Tanjamuse , Klaatu Barada Niktu!Tanjamuse , Klaatu Barada Niktu!Tanjamuse , Klaatu Barada Niktu!Tanjamuse , Klaatu Barada Niktu!Tanjamuse , Klaatu Barada Niktu!
 
Posts: 1,327
Karma: 5306
Join Date: Jan 2014
Device: none
How about just an option for sorting the books?

Example: First by set of duplicate and then by a date column?

Then I would know automatically that the second book would always be the oldest?

Thanks so much in advance.
Tanjamuse is offline   Reply With Quote
Old 08-31-2020, 01:52 PM   #745
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 31,076
Karma: 60358908
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by Tanjamuse View Post
How about just an option for sorting the books?

Example: First by set of duplicate and then by a date column?

Then I would know automatically that the second book would always be the oldest?

Thanks so much in advance.
You can already sort the results. Simply right-click: sort-by: date (or any other column. That is a filtered view, so Calibre sorting user operations still applies)
Note: Date is (normally) the date the record was created and not necessarily the format within
theducks is offline   Reply With Quote
Old 08-31-2020, 02:26 PM   #746
capink
Wizard
capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.
 
Posts: 1,196
Karma: 1995558
Join Date: Aug 2015
Device: Kindle
Quote:
Originally Posted by theducks View Post
You can already sort the results. Simply right-click: sort-by: date (or any other column. That is a filtered view, so Calibre sorting user operations still applies)
Note: Date is (normally) the date the record was created and not necessarily the format within
The above will not work regardless of the method used to display duplicates (whether it is showing one group at a time, or all at once). This is because the plugin applies its own sort filters and refresh them between results.

This means that if all duplicates are shown at once, sorting by date will mess up the groups because it overrides the plugin mechanism for showing them next to each other. On the other hand, if the plugin is set to show one group at a time, each time you move to the next group, the plugin will override whatever sort filter you applied in the previous group.
capink is offline   Reply With Quote
Old 09-26-2020, 06:19 AM   #747
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 79,792
Karma: 146391129
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Can someone please fix Fnd DUplicates for Calibre 5? Thanks.
JSWolf is offline   Reply With Quote
Old 09-26-2020, 06:38 AM   #748
capink
Wizard
capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.
 
Posts: 1,196
Karma: 1995558
Join Date: Aug 2015
Device: Kindle
What exactly is not working in calibre 5?
capink is offline   Reply With Quote
Old 09-26-2020, 06:47 AM   #749
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 79,792
Karma: 146391129
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Quote:
Originally Posted by capink View Post
What exactly is not working in calibre 5?
There's another thread where someone is saying that DeDRM and Find Duplicates is not working with Calibre 5.

DeDRM is a known issue that is easily solved with a 4.23 portable install.

I guess I should have looked at the update history before posting. Sorry.
JSWolf is offline   Reply With Quote
Old 09-26-2020, 02:09 PM   #750
mbovenka
Wizard
mbovenka ought to be getting tired of karma fortunes by now.mbovenka ought to be getting tired of karma fortunes by now.mbovenka ought to be getting tired of karma fortunes by now.mbovenka ought to be getting tired of karma fortunes by now.mbovenka ought to be getting tired of karma fortunes by now.mbovenka ought to be getting tired of karma fortunes by now.mbovenka ought to be getting tired of karma fortunes by now.mbovenka ought to be getting tired of karma fortunes by now.mbovenka ought to be getting tired of karma fortunes by now.mbovenka ought to be getting tired of karma fortunes by now.mbovenka ought to be getting tired of karma fortunes by now.
 
Posts: 2,079
Karma: 14079267
Join Date: Oct 2007
Location: Almere, The Netherlands
Device: Kobo Sage
Quote:
Originally Posted by JSWolf View Post
There's another thread where someone is saying that DeDRM and Find Duplicates is not working with Calibre 5.
Find Duplicates works fine with Calibre 5.
mbovenka is offline   Reply With Quote
Reply

Tags
cross library duplicates, in library duplicates


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
[GUI Plugin] Generate Cover kiwidude Plugins 862 07-24-2025 08:49 PM
[GUI Plugin] View Manager kiwidude Plugins 416 07-16-2025 05:35 PM
[GUI Plugin] Quality Check kiwidude Plugins 1251 07-07-2025 09:13 PM
[GUI Plugin] Open With kiwidude Plugins 404 02-21-2025 05:42 AM
[GUI Plugin] Plugin Updater **Deprecated** kiwidude Plugins 159 06-19-2011 12:27 PM


All times are GMT -4. The time now is 04:48 PM.


MobileRead.com is a privately owned, operated and funded community.