View Single Post
Old 03-26-2024, 09:59 AM   #63
lomkiri
Groupie
lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.
 
lomkiri's Avatar
 
Posts: 170
Karma: 1497966
Join Date: Jul 2021
Device: N/A
Replace a list of words by the words of another list (works with big numbers) (v2)

Quote:
Originally Posted by moldy View Post
For my purposes I would have liked to have been able to add regex to the keys individually in the json file rather than globally
Modify your entries in the json in this way:
"old_name": ["new_name", "regex"]
If you don't provide a regex, the one from the function will be used. In this case, the entry shall be either (at your choice) :
"old_name": ["new_name"]
or "old_name": "new_name"
or "old_name": ["new_name", ""]
Example :
Code:
{
  "John": ["Mike", "\\b{}(?![^<>]*>)"],
  "Paul": "Keith",
  "George": ["Ronnie",  "{}\\b(?![^<>]*>)"],
  "Ringo": ["Charlie"]
}
Inside the regex, {} will be replaced in the function by old_name using str.format()
You must double all the antislashes in the json
You cannot use the regex to search curly brackets in the text, since format() will try to interpret them, if it's necessary, a work-around must be used.

The function is:
Code:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    """
    Replace words using a dict in a json file, with possibility of counters
    The regex for replacing each word may (or may not) be provided with the word
    May search with case sensitive or not
    May search also for the capitalized or/and uppercase of the word in the entry
    
    find : <body[^>]*>\K(.+)</body>
    "dot all" must be checked
    """
    
    ### Parameters
    # Put the file 'change_words.json' in the config-folder of calibre
    # If you choose another name for the json, change it here:
    fname = 'change_words.json'
    #
    # case_sensitive: Search with case sensitive. If False, will nullify do_capitalize and do_uppercase, 
    # If False, john (as an entry) will also match JOHN, jOHn, etc (the case of the replacement is preserved),
    # and if good -> bad is in the list, then good, GOOD, Good, gOOD will all be transformed in bad
    # You probably want case_sensitive = True
    case_sensitive = True
    #
    # do_capitalize: If True, search also for the capitalized form, e.g. if good -> bad is in the list, search also for Good -> Bad
    do_capitalize = True    # False => the case of the words in the list has to be strictly respected
    #
    # do_uppercase: Same as above for uppercase : if good -> bad is in the list, search also for GOOD -> BAD
    do_uppercase = False
    #
    # do_capitalize and do_uppercase don't apply if the search is case insensitive
    do_capitalize = case_sensitive and do_capitalize
    do_uppercase = case_sensitive and do_uppercase
    #
    # If do_count is True, will write the total of changes.
    do_count = True    # put False if you don't want any counter
    #
    # It count_by_name is also True, the function will write the counters by name in the
    # file "change_words_counters.json" (in the config-folder of calibre)
    count_by_name = True
    counters_fname = 'change_words_counters.json'
    #
    # The regex to be applied for the names that have no regex provided in the json:
    rgx_default = r'\b{}\b(?![^<>]*>)'  # Find the key, excluding everything between <...> :
    ### End Parameters

    from calibre.utils.config import JSONConfig
    import regex
    flags = 0 if case_sensitive else regex.IGNORECASE

    # === Last passage: if counters were asked in the heading of this function
    if match == None:
        print(f"{data['initial_lengh']} entries in the file '{fname}'")
        if not case_sensitive:
            print(f"Searching with case insensitive.")
        else:
            if do_capitalize:
                print(f"Searching also for the capitalized form of the words of the dict.")
            if do_uppercase:
                print(f"Searching also for the uppercase form of the words of the dict.")
            if do_capitalize or do_uppercase:
                print(f"{len(data['equiv'])} entries after adding the capitalized and/or uppercase form of the words.")

        if data['total'] > 0:
            print(f"In this dict, {len(data['counters'])} words had at least one occurrence\n"
                  f"=== The total of all changes is: {data['total']} ===\n\n"
                  f"The file {counters_fname} has been written with the counters by word" if count_by_name else '')

            if count_by_name and counters_fname:
                json = JSONConfig(counters_fname)
                json.clear()
                json.update(data['counters'])
                json.commit()
        else:
            print('No occurrence found.\n')
        return

    # === First passage
    # Load the json file only at first passage
    # The dict "data" retains its values throught all the passages when "replace all"
    if number == 1:
        data['equiv'] = JSONConfig(fname).copy()    # .copy() is here to avoid the json file to be formated or modified
        if not data['equiv']:
            print(f'Problem loading {fname}, no treatment will be done')
            return

        # Prepare the dict for the treatment:
        d = data['equiv'].copy()
        for key, val in d.items():
            if isinstance(val, str):
                val = [val, rgx_default]                
            elif len(val) == 1:
                 val.append(rgx_default)                
            elif not val[1]:
                val[1] = rgx_default                
            data['equiv'][key] = val            
            # If param asks to search also for the capitalized form:
            if do_capitalize and (key_cap := key.capitalize()) != key:
                data['equiv'][key_cap] = [val[0].capitalize(), val[1]]
            if do_uppercase and (key_up := key.upper()) != key:
                data['equiv'][key_up] = [val[0].upper(), val[1]]
                
        # Activate the counters:
        if do_count:
            replace.call_after_last_match = True    # Ask for last passage
            data['initial_lengh'] = len(d)  # differs de len(data['equiv']) if do_capitalize is True
            data['total'] = 0
            data['counters']= {}

    # === normal passage
    m = match.group()
    for key, val in data['equiv'].items():
        [new_name, rgx] = val
        m, n = regex.subn(rgx.format(key), new_name, m, flags=flags)
        if do_count and n:
            data['total'] += n
            data['counters'][key] = data['counters'].get(key, 0) + n
    return m
Edit (22/02/2025): This function has been modified to consider the message 70 of maxthegold (thank you for the idea), it now offers by default to look for the capitalized form of each word of the list (see param "do_capitalize"). Thus, the modification proposed by maxthegold is included (in another way) and is not necessary anymore. A new counter gives the number of entries after capitalization.

Edit (26/04/2025): This function has been modified to consider the message 83 and may now search for the uppercase form of an entry (good -> bad will also find GOOD and transform it in BAD)
Furthermore, I added a parameter for searching with case insensitive (good -> bad will transform "good GOOD Good" in "bad bad bad"), This parameter voids the parameters do_capitalize and do_upper.

Last edited by lomkiri; 05-04-2025 at 07:07 AM. Reason: more detailed help
lomkiri is offline   Reply With Quote