Search and Replace from a List - Page 2

lomkiri · 03-14-2024, 11:27 PM

I trie

Quote:

Originally Posted by moldy

ERROR: No replace function: You must create a Python function named replace in your code

It's because it's pure python, not a regex-function.
From your messages, I understood that you knew a little about python, so it was made to be executed in the python command line or in the calibre-debug prompt, by a python program (in a file) or called in interactive mode.

Create your json-file with the dict, then create a file extract.py, put this in it :

Code:

def main():
    import json
    fname = '/data/temp/beastones.json'   # adapt this to your needs
    equiv = json.load(open(fname))
    if not equiv:
        print(f'Problem loading {fname}')
        return
    print( '|'.join(equiv.keys()))

main()

after adapting the path and name of the json file in the code

Then go to the command line, and type
[if you are on linux:] python3 your/path/extract.py
[if you are on windows:] calibre-debug your\path\extract.py

The function print() will display on command line, with your example:
John|Paul|George|Ringo
Then you may copy-paste it in the find "field" of the search that will make the substitution.

More help on calibre-debug with the option --help

moldy · 03-18-2024, 09:37 AM

Quote:

Originally Posted by lomkiri

I trie

It's because it's pure python, not a regex-function.
From your messages, I understood that you knew a little about python, so it was made to be executed in the python command line or in the calibre-debug prompt, by a python program (in a file) or called in interactive mode.

I only know a little Python and I couldn't get the code to run in the interpreter. Then I got a little er... confused.

Anyway; I discovered what was wrong (syntax error in the json) and managed to get the dict method to work perfectly.

However when experimenting with my actual working file I found the massive size of the data in the find field somewhat unwieldy to say the least.

In the end I decided upon a simpler solution that also works as expected.

Find field:

Code:

>[^<>]+<

Function:

Code:

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    return match.group().replace('John','Mick').replace('George','Keith').replace('Paul','Ronnie').replace('Ringo','Charlie')

My working data file is in 2 columns of text so using Notepad++ in column mode I can easily add all the other punctuation and then remove the superfluous spaces. Its also easy to add/remove/change data then copy and paste into the function.

Many thanks for your input Lomkiri. Your time wasn't totally wasted as I learned a lot from your suggestions.

lomkiri · 03-19-2024, 12:57 PM

Quote:

Originally Posted by moldy

However when experimenting with my actual working file I found the massive size of the data in the find field somewhat unwieldy to say the least.

You were who asked to put all the searched words in the find field :-).
With your new search string, you could use your json file in this way, avoiding the need of hundred of ptyhon replaces :

Code:

# your code : 
    # return match.group().replace('John','Mick').replace('George','Keith').replace('Paul','Ronnie').replace('Ringo','Charlie')

# Alternative code :
    # insert here the code to load the json file into the dict "equiv"
    m = match.group(0) 
    for key in equiv:
        m = m.replace(key, equiv[key])
    return m

With this code, the function is generic, you need to modify only the json file for another set of searched words.

If you're sure that none of the searched words is inside a tag (as "body", "span", or a class name, for example), you could even search the whole html page, much quicker :
find : <body[^>]*>\K(.+)</body> (with "dot all" checked)

moldy · 03-20-2024, 12:15 PM

Quote:

In the end I decided upon a simpler solution that also works as expected.

Actually it didn’t work as I wanted. Using the example of John George etc. there were matches for not only John but also Johnson, Johnjo LongJohn and so on.
To counteract this I tried wrapping John in \b anchors in the function - no matches at all. After researching online I tried escaping the backslash \\b - no matches. After more reading I tried escaping the escape characters \\\\b - no matches. After even more research I tried the raw data solution r”\bJohn” - no matches.

I would like to go back to the dict method again (as described in lomkiri’s suggestion above). However I think I will probably have the same issue there when the pairs are passed to the function from the json file.

Is there another way around this?

lomkiri · 03-20-2024, 04:33 PM

Quote:

Originally Posted by moldy

To counteract this I tried wrapping John in \b anchors in the function

It should have worked (in a regex, but not with the python str.replace())

Quote:

I would like to go back to the dict method again (as described in lomkiri’s suggestion above).

Try this :

Code:

    # insert here the code to load the json file into the dict "equiv"
    # (see my post #12 for this code)
    import regex
    m = match.group() 
    for key in equiv:
        m = regex.sub(rf'\b{key}\b', equiv[key], m)
    return m

It works, I have tested it :
Johnson, Johnjo LongJohn and so on John and Ringo, and also john ==>
Johnson, Johnjo LongJohn and so on Mick and Charlie, and also john

Note: rf'\b{key}\b' is the same as r'\b{}\b'.format(key) and will be expanded to '\bJohn\b' if key == 'John'

It works with either <body[^>]*>\K(.+)</body> (with "dot all" checked) or >\K([^>]+)(?![^<>{}]*[>}]) (but the 1st form will be quicker, treating one whole html file at each iteration, with the condition, as I said above, that none of your keys will match something inside an html tag). The 2nd form will select the text between tags and avoid the part inside the tag.

moldy · 03-21-2024, 11:09 AM

I can't get this to work.The function considers that it has made just 1 replacement but actually it hasn't. Please view image at:

My find code is:

Code:

<body[^>]*>\K(.+)</body>

And my function is:

Code:

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    
    from calibre.utils.config import JSONConfig
    m = match[0]


    if number == 1:
        fname = 'beatstones.json'
        data['equiv'] = JSONConfig(fname)
        if not data['equiv']:
            print(f'Problem loading {fname}, no treatment will be done')
    
    return data['equiv'].get(m, m)

            
    import regex
    m = match.group() 
    for key in equiv:
        m = regex.sub(rf'\b{key}\b', equiv[key], m)
    return m

There are no errors reported. As far as I can see there are no problems from the json file and I can extract the keys from it. It must be a problem with the function but, with my limited knowledge, I can't find it.

moldy · 03-21-2024, 11:10 AM

https://imgur.com/a/21b0fFD

lomkiri · 03-21-2024, 12:19 PM

It's because you should have adapted the code.

1) The line "return data['equiv'].get(m, m)" is from the old code, it was not to be included.
2) In this code, the dict is loaded in data['equiv'], not in equiv, so you'll have to adapt the new code to this fact (the reason I've loaded it in data is that, doing this, it's necessary to load the json only once for all passages)
3) Since you're loading one whole page, it's normal that there is only one change. The regex system counts the times it takes an expression (a page, in this case). It will count a change even if there is no change in the page (it has no way to know if the "m" you return has been modified).
If you've got 5 pages, it will give you 5 changes, even if you have 100 changes in the 1st page, and none in the other 4 pages.
Click in "See modifications" to see the real changes.
4) If you want to know how many changes have been made, you'll have to use subn(), not sub(), and must increment a counter in data['counts'], and "print" this counter during the last passage (ask if you need it and you don't know how to do that).

The code (tested) is :

Code:

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    from calibre.utils.config import JSONConfig
    import regex

    # Load json only at first passage
    # data will retain its values throught all passages when "replace all"
    if number == 1:
        fname = 'beastones.json'
        data['equiv'] = JSONConfig(fname)
        if not data['equiv']:
            print(f'Problem loading {fname}, no treatment will be done')
            
    # normal passage
    m = match.group() 
    for key, val in data['equiv'].items():
        m = regex.sub(rf'\b{key}\b', val, m)
    return m

The json file (beastones.json, in this case, change fname if you choose another filename) must be in the config folder of calibre, and must contain :

Code:

{
  "John": "Mike",
  "Paul": "Keith",
  "George": "Ronnie",
  "Ringo": "Charlie"
}

moldy · 03-22-2024, 11:15 AM

Thank you lomkiri; the function above works perfectly using my large data file. Both find statements work equally well for my purposes.

Thanks also for your perseverance and patience.

moldy · 03-23-2024, 06:31 AM

Looking at the function further @lomkiri.

Using Find: >\K([^>]+)(?![^<>{}]*[>}])

Because of the look-ahead I would have expected any text inside <> or {} to be ignored as part of the match. However using the example:

<p>John <George> {Paul} Ringo</p>

George is not matched but Paul is. Have I mis-understood how the look-ahead works?

lomkiri · 03-23-2024, 08:22 AM

[}>]\K([^>}]+)(?![^<>{}]*[>}])

(the curved brackets are here to avoid inline styles, if there are no such parts you can get rid of them)

lomkiri · 03-23-2024, 09:46 AM

Kindly proposed by EbookMakers, who is master es-regexes :-)

Excluding all that is inside <>
>\K([^<>]+)(?=<)

Excluding all that is inside <> and {}
[>}]\K([^<>{}]+)(?=[<{])

lomkiri · 03-24-2024, 03:47 PM

I have posted in the pinned thread Saved Search/Regex Functions an enhanced version of this function.

The regex inside the function avoids the content of the html tags, so we are free now to scan the whole page, even if some class names are in the list.
It doesn't avoid anymore the text inside {} since the inline styles are not selected by the main regex (of the "find" field).

I have written also a longer version with counters (total of all changes, and (in a json file) counters by word)

lomkiri · 03-26-2024, 10:52 AM

Quote:

Originally Posted by moldy

Thank you @lomkiri. The function, including the counters, works perfectly on my data file of 187 entries (and growing ever larger).

You're very welcome. Glad it fits your needs.

A friend asked me what would be the practical use of this function, and I must say I was unable to answer :-) (out of a stalinist revision of ebooks about history ;p)
Out of curiosity, how do you use it? I mean: what is the situation where you need to translate a list by another?

moldy · 03-26-2024, 01:35 PM

I have sent a pm.

03-21-2024, 11:09 AM	#21
moldy Enthusiast Posts: 38 Karma: 10 Join Date: Oct 2015 Device: Kindle	I can't get this to work.The function considers that it has made just 1 replacement but actually it hasn't. Please view image at: My find code is: Code: <body[^>]>\K(.+)</body> And my function is: Code: def replace(match, number, file_name, metadata, dictionaries, data, functions, args, **kwargs): from calibre.utils.config import JSONConfig m = match[0] if number == 1: fname = 'beatstones.json' data['equiv'] = JSONConfig(fname) if not data['equiv']: print(f'Problem loading {fname}, no treatment will be done') return data['equiv'].get(m, m) import regex m = match.group() for key in equiv: m = regex.sub(rf'\b{key}\b', equiv[key], m) return m There are no errors reported. As far as I can see there are no problems from the json file and I can extract the keys from it. It must be a problem with the function but, with my limited knowledge, I can't find it.

03-21-2024, 12:19 PM	#23
lomkiri Zealot Posts: 136 Karma: 1000102 Join Date: Jul 2021 Device: N/A	It's because you should have adapted the code. 1) The line "return data['equiv'].get(m, m)" is from the old code, it was not to be included. 2) In this code, the dict is loaded in data['equiv'], not in equiv, so you'll have to adapt the new code to this fact (the reason I've loaded it in data is that, doing this, it's necessary to load the json only once for all passages) 3) Since you're loading one whole page, it's normal that there is only one change. The regex system counts the times it takes an expression (a page, in this case). It will count a change even if there is no change in the page (it has no way to know if the "m" you return has been modified). If you've got 5 pages, it will give you 5 changes, even if you have 100 changes in the 1st page, and none in the other 4 pages. Click in "See modifications" to see the real changes. 4) If you want to know how many changes have been made, you'll have to use subn(), not sub(), and must increment a counter in data['counts'], and "print" this counter during the last passage (ask if you need it and you don't know how to do that). The code (tested) is : Code: def replace(match, number, file_name, metadata, dictionaries, data, functions, args, kwargs): from calibre.utils.config import JSONConfig import regex # Load json only at first passage # data will retain its values throught all passages when "replace all" if number == 1: fname = 'beastones.json' data['equiv'] = JSONConfig(fname) if not data['equiv']: print(f'Problem loading {fname}, no treatment will be done') # normal passage m = match.group() for key, val in data['equiv'].items(): m = regex.sub(rf'\b{key}\b', val, m) return m The json file (beastones.json, in this case, change fname if you choose another filename) must be in the config folder of calibre, and must contain : Code: { "John": "Mike", "Paul": "Keith", "George": "Ronnie", "Ringo": "Charlie" } Last edited by lomkiri; 03-22-2024 at 07:16 PM. Reason: screenshot removed*

03-23-2024, 08:22 AM	#26
lomkiri Zealot Posts: 136 Karma: 1000102 Join Date: Jul 2021 Device: N/A	[}>]\K([^>}]+)(?![^<>{}][>}]) (the curved brackets are here to avoid inline styles, if there are no such parts you can get rid of them) Last edited by lomkiri; 03-23-2024 at 09:26 AM.*

03-23-2024, 09:46 AM	#27
lomkiri Zealot Posts: 136 Karma: 1000102 Join Date: Jul 2021 Device: N/A	Kindly proposed by EbookMakers, who is master es-regexes :-) Excluding all that is inside <> >\K([^<>]+)(?=<) Excluding all that is inside <> and {} [>}]\K([^<>{}]+)(?=[<{]) Last edited by lomkiri; 03-23-2024 at 10:45 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Search and Replace	Ashjuk	Sigil	10	02-25-2021 11:17 AM
Regex in search problems (NOT Search&Replace; the search bar)	lairdb	Calibre	3	03-15-2017 07:10 PM
save multiple search/replace, or search/replace multiple ebooks	user743	Editor	12	04-12-2014 02:38 AM
Search and Replace Help	Squidly21	Conversion	2	01-08-2014 12:19 AM
search and replace - drops blanks in replace ?	cybmole	Conversion	10	03-13-2011 03:07 AM

03-21-2024, 11:10 AM	#22
moldy Enthusiast Posts: 38 Karma: 10 Join Date: Oct 2015 Device: Kindle	https://imgur.com/a/21b0fFD

03-22-2024, 11:15 AM	#24
moldy Enthusiast Posts: 38 Karma: 10 Join Date: Oct 2015 Device: Kindle	Thank you lomkiri; the function above works perfectly using my large data file. Both find statements work equally well for my purposes. Thanks also for your perseverance and patience.

03-23-2024, 06:31 AM	#25
moldy Enthusiast Posts: 38 Karma: 10 Join Date: Oct 2015 Device: Kindle	Looking at the function further @lomkiri. Using Find: >\K([^>]+)(?![^<>{}]*[>}]) Because of the look-ahead I would have expected any text inside <> or {} to be ignored as part of the match. However using the example: <p>John <George> {Paul} Ringo</p> George is not matched but Paul is. Have I mis-understood how the look-ahead works?

03-24-2024, 03:47 PM	#28
lomkiri Zealot Posts: 136 Karma: 1000102 Join Date: Jul 2021 Device: N/A	I have posted in the pinned thread Saved Search/Regex Functions an enhanced version of this function. The regex inside the function avoids the content of the html tags, so we are free now to scan the whole page, even if some class names are in the list. It doesn't avoid anymore the text inside {} since the inline styles are not selected by the main regex (of the "find" field). I have written also a longer version with counters (total of all changes, and (in a json file) counters by word)

03-26-2024, 01:35 PM	#30
moldy Enthusiast Posts: 38 Karma: 10 Join Date: Oct 2015 Device: Kindle	I have sent a pm.

Advert

Advert