Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Editor

Notices

Reply
 
Thread Tools Search this Thread
Old 03-24-2024, 03:26 PM   #61
lomkiri
Groupie
lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.
 
lomkiri's Avatar
 
Posts: 173
Karma: 1497966
Join Date: Jul 2021
Device: N/A
Replace a list of words by the words of another list (works with big numbers)

(See 2 messages below to get the last version of the function and of the json file)

In the thread Search and Replace from a List, moldy asked how he could replace a list of names by the names of another list, i.e.
["John", "Paul", "George", "Ringo"] by
["Mick", "Keith", "Ronnie", "Charlie"].

A way to do this with generic regex and function is to use an external json file including a dict, so we have just to adapt the dict to our needs, and it works with any number of words in the list (even hundreds, which was the case for Moldy).

With our example, let put the file change_words.json in the config folder of calibre, this file containing :
Code:
{
  "John": "Mike",
  "Paul": "Keith",
  "George": "Ronnie",
  "Ringo": "Charlie"
}
Code:
find : <body[^>]*>\K(.+)</body>
"dot all" must be checked
This regex selects the whole html page, i.e. all text in the <body>
The regex inside the function will avoid everything inside <>, so the html tags won't be scanned.
The function, in its simpler way (without any counters) is:
Code:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    # Replace words using a dict in a json file (without counters)
    from calibre.utils.config import JSONConfig
    import regex

    # Put the file 'change_words.json' in the config-folder of calibre
    # If you choose another name for the json, change it here:
    fname = 'change_words.json'

    # Load json only at first passage
    # data will retain its values throught all passages when "replace all"
    if number == 1:
        data['equiv'] = JSONConfig(fname)
        if not data['equiv']:
            print(f'Problem loading {fname}, no treatment will be done')

    # normal passage
    m = match.group()
    for key, val in data['equiv'].items():
        # Find key, excluding everything between <...>
        m = regex.sub(rf'\b{key}\b(?![^<>]*>)', val, m)
    return m
Note: Each replace is a whole page, so the number of changes will be the number of pages scanned, even if in the 1st page there is one hundred change and none in the other pages.

If we want a counter with the number of real changes, the code is the one below. Calibre will open a "debug window" at the end of the replace action with some counters. It is also possible to write a json file with the number of changes by word (enable by default). Set the variables "do_count" and "count_by_name" as you need.

Counters will be more accurate with "replace all".
If replacements are made one by one, the counters will be reset at each file.

Code:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    # Replace words using a dict in a json file, with possibility of counters
    from calibre.utils.config import JSONConfig
    import regex

    ### Parameters
    # Put the file 'change_words.json' in the config-folder of calibre
    # If you choose another name for the json, change it here:
    fname = 'change_words.json'

    # If do_count is True, will write the total of changes.
    # It count_by_name is also True, the function will write the counters by name in the
    # file "change_words_counters.json" (in the config-folder of calibre)
    do_count = True    # put False if you don't want any counter
    count_by_name = True
    counters_fname = 'change_words_counters.json'
    ### End Parameters


    # === Last passage: if counters were asked in the heading of this function
    if match == None:
        if data['total'] == 0:
            print('No occurrence found.\n'
                  f"We had to change a list of {len(data['equiv'])} words (in {fname})")
            return

        if count_by_name:
            json = JSONConfig(counters_fname)
            json.clear()
            json.update(data['counters'])
            json.commit()

        print(f"We had to change a list of {len(data['equiv'])} words (in {fname})\n"
              f"In this list, {len(data['counters'])} words had at least one occurrence\n"
              f"=== The total of all changes is: {data['total']} ===\n\n"
              f"The file {counters_fname} has been written with the counters by word" if count_by_name else '')
        return

    # === First passage
    # Load the json file only at first passage
    # data will retain its values throught all passages when "replace all"
    if number == 1:
        data['equiv'] = JSONConfig(fname)
        if not data['equiv']:
            print(f'Problem loading {fname}, no treatment will be done')
        if do_count:
            replace.call_after_last_match = True    # Ask for last passage
            data['total'] = 0
            data['counters']= {}


    # === normal passage
    m = match.group()
    for key, val in data['equiv'].items():
        # Find key, excluding everything between <...>
        m, n = regex.subn(rf'\b{key}\b(?![^<>]*>)', val, m)
        if do_count and n:
            data['total'] += n
            data['counters'][key] = data['counters'].get(key, 0) + n
    return m

Last edited by lomkiri; 04-26-2025 at 05:06 AM.
lomkiri is offline   Reply With Quote
Old 03-26-2024, 06:30 AM   #62
moldy
Enthusiast
moldy began at the beginning.
 
Posts: 43
Karma: 10
Join Date: Oct 2015
Device: Kindle
Thank you @lomkiri. The function, including the counters, works perfectly on my data file of 187 entries (and growing ever larger).

For my purposes I would have liked to have been able to add regex to the keys individually in the json file rather than globally in the line:
Code:
 m = regex.sub(rf'\b{key}\b(?![^<>]*>)', val, m)
This is in no way a criticism and the function is perfectly adequate for my needs as is.

Some of the entries in my data benefit from having the trailing '\b' removed while in others it causes errors. Some entries would be better with a little more regex added.

I can get round this by writing searches/replaces in a json with regex added and importing it into Saved Searches. Obviously this takes a little more time.
moldy is offline   Reply With Quote
Old 03-26-2024, 09:59 AM   #63
lomkiri
Groupie
lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.
 
lomkiri's Avatar
 
Posts: 173
Karma: 1497966
Join Date: Jul 2021
Device: N/A
Replace a list of words by the words of another list (works with big numbers) (v2)

Quote:
Originally Posted by moldy View Post
For my purposes I would have liked to have been able to add regex to the keys individually in the json file rather than globally
Modify your entries in the json in this way:
"old_name": ["new_name", "regex"]
If you don't provide a regex, the one from the function will be used. In this case, the entry shall be either (at your choice) :
"old_name": ["new_name"]
or "old_name": "new_name"
or "old_name": ["new_name", ""]
Example :
Code:
{
  "John": ["Mike", "\\b{}(?![^<>]*>)"],
  "Paul": "Keith",
  "George": ["Ronnie",  "{}\\b(?![^<>]*>)"],
  "Ringo": ["Charlie"]
}
Inside the regex, {} will be replaced in the function by old_name using str.format()
You must double all the antislashes in the json
You cannot use the regex to search curly brackets in the text, since format() will try to interpret them, if it's necessary, a work-around must be used.

The function is:
Code:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    """
    Replace words using a dict in a json file, with possibility of counters
    The regex for replacing each word may (or may not) be provided with the word
    May search with case sensitive or not
    May search also for the capitalized or/and uppercase of the word in the entry
    
    find : <body[^>]*>\K(.+)</body>
    "dot all" must be checked
    """
    
    ### Parameters
    # Put the file 'change_words.json' in the config-folder of calibre
    # If you choose another name for the json, change it here:
    fname = 'change_words.json'
    #
    # case_sensitive: Search with case sensitive. If False, will nullify do_capitalize and do_uppercase, 
    # If False, john (as an entry) will also match JOHN, jOHn, etc (the case of the replacement is preserved),
    # and if good -> bad is in the list, then good, GOOD, Good, gOOD will all be transformed in bad
    # You probably want case_sensitive = True
    case_sensitive = True
    #
    # do_capitalize: If True, search also for the capitalized form, e.g. if good -> bad is in the list, search also for Good -> Bad
    do_capitalize = True    # False => the case of the words in the list has to be strictly respected
    #
    # do_uppercase: Same as above for uppercase : if good -> bad is in the list, search also for GOOD -> BAD
    do_uppercase = False
    #
    # do_capitalize and do_uppercase don't apply if the search is case insensitive
    do_capitalize = case_sensitive and do_capitalize
    do_uppercase = case_sensitive and do_uppercase
    #
    # If do_count is True, will write the total of changes.
    do_count = True    # put False if you don't want any counter
    #
    # It count_by_name is also True, the function will write the counters by name in the
    # file "change_words_counters.json" (in the config-folder of calibre)
    count_by_name = True
    counters_fname = 'change_words_counters.json'
    #
    # The regex to be applied for the names that have no regex provided in the json:
    rgx_default = r'\b{}\b(?![^<>]*>)'  # Find the key, excluding everything between <...> :
    ### End Parameters

    from calibre.utils.config import JSONConfig
    import regex
    flags = 0 if case_sensitive else regex.IGNORECASE

    # === Last passage: if counters were asked in the heading of this function
    if match == None:
        print(f"{data['initial_lengh']} entries in the file '{fname}'")
        if not case_sensitive:
            print(f"Searching with case insensitive.")
        else:
            if do_capitalize:
                print(f"Searching also for the capitalized form of the words of the dict.")
            if do_uppercase:
                print(f"Searching also for the uppercase form of the words of the dict.")
            if do_capitalize or do_uppercase:
                print(f"{len(data['equiv'])} entries after adding the capitalized and/or uppercase form of the words.")

        if data['total'] > 0:
            print(f"In this dict, {len(data['counters'])} words had at least one occurrence\n"
                  f"=== The total of all changes is: {data['total']} ===\n\n"
                  f"The file {counters_fname} has been written with the counters by word" if count_by_name else '')

            if count_by_name and counters_fname:
                json = JSONConfig(counters_fname)
                json.clear()
                json.update(data['counters'])
                json.commit()
        else:
            print('No occurrence found.\n')
        return

    # === First passage
    # Load the json file only at first passage
    # The dict "data" retains its values throught all the passages when "replace all"
    if number == 1:
        data['equiv'] = JSONConfig(fname).copy()    # .copy() is here to avoid the json file to be formated or modified
        if not data['equiv']:
            print(f'Problem loading {fname}, no treatment will be done')
            return

        # Prepare the dict for the treatment:
        d = data['equiv'].copy()
        for key, val in d.items():
            if isinstance(val, str):
                val = [val, rgx_default]                
            elif len(val) == 1:
                 val.append(rgx_default)                
            elif not val[1]:
                val[1] = rgx_default                
            data['equiv'][key] = val            
            # If param asks to search also for the capitalized form:
            if do_capitalize and (key_cap := key.capitalize()) != key:
                data['equiv'][key_cap] = [val[0].capitalize(), val[1]]
            if do_uppercase and (key_up := key.upper()) != key:
                data['equiv'][key_up] = [val[0].upper(), val[1]]
                
        # Activate the counters:
        if do_count:
            replace.call_after_last_match = True    # Ask for last passage
            data['initial_lengh'] = len(d)  # differs de len(data['equiv']) if do_capitalize is True
            data['total'] = 0
            data['counters']= {}

    # === normal passage
    m = match.group()
    for key, val in data['equiv'].items():
        [new_name, rgx] = val
        m, n = regex.subn(rgx.format(key), new_name, m, flags=flags)
        if do_count and n:
            data['total'] += n
            data['counters'][key] = data['counters'].get(key, 0) + n
    return m
Edit (22/02/2025): This function has been modified to consider the message 70 of maxthegold (thank you for the idea), it now offers by default to look for the capitalized form of each word of the list (see param "do_capitalize"). Thus, the modification proposed by maxthegold is included (in another way) and is not necessary anymore. A new counter gives the number of entries after capitalization.

Edit (26/04/2025): This function has been modified to consider the message 83 and may now search for the uppercase form of an entry (good -> bad will also find GOOD and transform it in BAD)
Furthermore, I added a parameter for searching with case insensitive (good -> bad will transform "good GOOD Good" in "bad bad bad"), This parameter voids the parameters do_capitalize and do_upper.

Last edited by lomkiri; 05-04-2025 at 07:07 AM. Reason: more detailed help
lomkiri is offline   Reply With Quote
Old 03-26-2024, 08:09 PM   #64
lomkiri
Groupie
lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.
 
lomkiri's Avatar
 
Posts: 173
Karma: 1497966
Join Date: Jul 2021
Device: N/A
27/03 : I have slightly modified the function so the json file is not modified anymore (JSONConfig natively formats and orders the json source).
An entry {"John": ["Mike", ""], ... is now accepted (default regex applies)
Some cosmetic changes

28/03 : correction of a new bug that would have brought an exception in some cases.

Last edited by lomkiri; 03-28-2024 at 02:20 PM.
lomkiri is offline   Reply With Quote
Old 03-28-2024, 07:01 AM   #65
moldy
Enthusiast
moldy began at the beginning.
 
Posts: 43
Karma: 10
Join Date: Oct 2015
Device: Kindle
Quote:
I have slightly modified the function so the json file is not modified anymore ....
Thanks lomkiri your latest installment of the function is a big improvement for me. I actually have two data files, one all lower case, and the other with a capitalised first letter. I have now merged these files into one.

The file now has 400 entries. Even with a file this size the function only takes a second or two to run.

My ultimate goal would be to have a function that would work as though it were possible to have a json something like this (similar to Saved Searches):

Code:
{"([P|p])aul([a-z]+)": "\1ol\2"}
I think that would find Paul paul Pauline pauline Paula paula and so on and invent several new names.

Last edited by moldy; 03-28-2024 at 07:07 AM. Reason: finger trouble
moldy is offline   Reply With Quote
Old 03-28-2024, 08:05 AM   #66
lomkiri
Groupie
lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.
 
lomkiri's Avatar
 
Posts: 173
Karma: 1497966
Join Date: Jul 2021
Device: N/A
In that case, I don't understand the interest of my function anymore, it will just mimick the saved searches features with a json in a slightly different form, and then the proposal of Theducks applies. It's easy of course to modify the function, it's just adding a optional 3rd dim to the list for each element of the json, but you can do that as well with a json for saved searches and apply all searches at once.

If you don't want to mix those specific searches with your usual ones, you may either use a portable calibre for this specific task, or you may backup your different jsons of saved searches and "delete all saved searches + import the right json".

What do you think ? If you still want this feature, I'll adapt the function at night, but IMO the saved searches are exactly fitting you needs.
lomkiri is offline   Reply With Quote
Old 03-28-2024, 02:16 PM   #67
lomkiri
Groupie
lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.
 
lomkiri's Avatar
 
Posts: 173
Karma: 1497966
Join Date: Jul 2021
Device: N/A
Mmmh, I wrote the former message in the train, and it was written too quickly :-).
No need to change anything in the function, just put an entry in you json :
Code:
"au": ["ol", "[Pp]\\K{}(?=[a-z]+\\b)"]
and it will do exactly what you've asked.

Explanation :
The regex will be stuffed in the function with "au", so we'll apply :
[Pp]\Kau(?=[a-z]+\b)
Find [Pp] then forget this (start from after [Pp]) (because of \K)
Find "au", but only if there is [a-z]+\b after it (because of the positive lookahead)
So only "au" is selected, and it will be replaced by "ol"

BUT: get the new version of the function above, I''ve corrected a little bug that would have brought an exception in some cases.

Last edited by lomkiri; 03-28-2024 at 07:14 PM.
lomkiri is offline   Reply With Quote
Old 03-29-2024, 06:37 AM   #68
moldy
Enthusiast
moldy began at the beginning.
 
Posts: 43
Karma: 10
Join Date: Oct 2015
Device: Kindle
Quote:
In that case, I don't understand the interest of my function anymore, it will just mimick the saved searches features with a json in a slightly different form,
You're absolutely correct in what you say lomkiri. I had lost sight of my original goal which was to automate the process as much as possible. The small number of entries that need further regex can be put into normal saved searches.

So; using yesterday's (28 March) version of the function and my data file of 411 entries (so far) every substitution was made correctlly, the counters counted as expected and there were no exception errors.

I think your task on earth is now complete lomkiri. You are free to return to the mother ship.
moldy is offline   Reply With Quote
Old 12-19-2024, 12:09 PM   #69
maxthegold
Member
maxthegold began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Mar 2011
Location: Weston-super-Mare, U.K.
Device: Kobo Libra 2
Rather than repeat all the words in the lookup file I have added some code to the search and replace. After the find and replace and incrementing of counts I put the following,
Code:
       # Do it again for capitalised words    
        m, n = regex.subn(rgx.format(key.capitalize()), new_name.capitalize(), m)
        if do_count and n:
            data['total'] += n
            data['counters'][key] = data['counters'].get(key) + n
This will add the count for the capitalised word to the count for the non capitalised word.

Other than that, thank-you lomkiri for a splendid piece of coding that does exactly what I was looking to do.

Last edited by maxthegold; 12-19-2024 at 12:14 PM.
maxthegold is offline   Reply With Quote
Old 12-19-2024, 02:31 PM   #70
maxthegold
Member
maxthegold began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Mar 2011
Location: Weston-super-Mare, U.K.
Device: Kobo Libra 2
Slight problem with the code on the previous post, here is the corrected version.

Code:
        # Do it again for capitalised words    
        m, o = regex.subn(rgx.format(key.capitalize()), new_name.capitalize(), m)
        if do_count and o:
            if n == 0:
                data['counters'][key] = 0
            data['total'] += o
            data['counters'][key] = data['counters'].get(key) + o
Rookie error, not used to Python.
maxthegold is offline   Reply With Quote
Old 02-13-2025, 01:34 PM   #71
Alinara
Member
Alinara began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Nov 2023
Device: none
Need help to delete the first author in different books

I want to delete the first author in different books. The author is different in name, so I need to select every 1 value in the author list. But if I try search regex it always use the phrase on every value on the multivalue author colum.

Author a b ::: c d

search (\w+)\s(\w+)
replace \2

result b ::: d

wished c d

Can someone help me?
Alinara is offline   Reply With Quote
Old 02-14-2025, 08:33 AM   #72
maxthegold
Member
maxthegold began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Mar 2011
Location: Weston-super-Mare, U.K.
Device: Kobo Libra 2
You could try,

search [\w--[\d]]+\s[\w--[\d]]+\W+([\w--[\d]]+\s[\w--[\d]]+)
replace \1
maxthegold is offline   Reply With Quote
Old 02-14-2025, 08:37 AM   #73
maxthegold
Member
maxthegold began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Mar 2011
Location: Weston-super-Mare, U.K.
Device: Kobo Libra 2
Quote:
Originally Posted by Alinara View Post
I want to delete the first author in different books. The author is different in name, so I need to select every 1 value in the author list. But if I try search regex it always use the phrase on every value on the multivalue author colum.

Author a b ::: c d

search (\w+)\s(\w+)
replace \2

result b ::: d

wished c d

Can someone help me?
You could try,

search [\w--[\d]]+\s[\w--[\d]]+\W+([\w--[\d]]+\s[\w--[\d]]+)
replace \1
maxthegold is offline   Reply With Quote
Old 02-15-2025, 11:59 AM   #74
Alinara
Member
Alinara began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Nov 2023
Device: none
That does not work :-(

I dont get how the values are saved in the multiauthor colum but you do not get them as normal strings with seperator.... Although it shows them so in the Preview
Alinara is offline   Reply With Quote
Old 02-15-2025, 03:52 PM   #75
lomkiri
Groupie
lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.
 
lomkiri's Avatar
 
Posts: 173
Karma: 1497966
Join Date: Jul 2021
Device: N/A
Quote:
Originally Posted by Alinara View Post
Can someone help me?
I've replied in a new thread, it seems to me that a new and dedicated thread is a better place, since this thread here is for general solutions that may apply to more people.
lomkiri is offline   Reply With Quote
Reply

Tags
conversion, errors, function, ocr, spelling


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
About saved searches and regex Carpatos Editor 22 09-30-2020 10:56 PM
Regex-Functions - getting user input CalibUser Editor 8 09-09-2020 04:26 AM
Difference in Manual Search and Saved Search phossler Editor 4 10-04-2015 12:17 PM
Help - Learning to use Regex Functions weberr Editor 1 06-13-2015 01:59 AM
Limit on length of saved regex? ElMiko Sigil 0 06-30-2013 03:32 PM


All times are GMT -4. The time now is 03:34 AM.


MobileRead.com is a privately owned, operated and funded community.