MobileRead Forums - View Single Post - Find a set of char and avoid the same set up to the sentence beginning.

lomkiri · 02-27-2024, 03:09 AM

Ok, your initial explanation was really unclear, I understood that the sentence you wanted to select was
 « A long sentence », said the man. « Then a second part. » 
and I made the regex for it
( )(?<! — )(«[^»]+»,.*?\.) (« )

BUT

let see if I really get what you mean : I understand that you want to target this sentence
 He came through the door. « I'm here », he said. 

and transform it to
 He came through the door.
 — « I'm here », he said. 

but not this one :
 — « Sentence ending with a comma, » said the man. « Then a second part. » 

Is it OK ? (as said pdurrant, you should have done you requests in this way, with examples and counter-examples, so it's much easier to understand.)

In that case, I'll do a little different, with the help of a regex-function.

The regex will be :

Code:

(<p class="calibre8"> <span class="calibre3">)(\s—\s«.+?,\s»\s)?([^.]+.) («)

Explanation :
group 1 : 
group 2 : \s—\s« bla,\s»\s
(note : I take that this dialog must end with ", »", if it is " »," it won't be selected in the group 2 and the paragraph will be split. If you don't want this (i.e. mandatory comma before the quote), change the regex accordingly
group 3 : The sentence before the next quote (shall the group 2 exist or not)
not in any group (won't be kept if we split) : <space>
group 4 : «

The group 2 can be missing since it is in the form (expr)?. Then, if missing, match[2] will be empty in the function, so this value will be tested to know if the line has to be split (group 2 empty) or not.

You said that you may have some &nbsp, so I put \s instead of <space>, it matches all types of spaces

"Dot all" must be unchecked (in french : Le point correspond à tout)

OLD CODE, DON'T USE IT: (see why in my next post)
The function, auto-explicative (comments begin with #), is:

Code:

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    if match[2]:
        # match[2] (the group 2) is " — « blabla, » "
        # If we have a group 2 (match[2] not empty) don't do anything (match[0] is the whole selection)
        return match[0]

    else:
	# We don't have a group 2, so the paragraph must be split
	# (the paragraph was selected by the regex, so we have a dialog in it)
	# match[1] is the html code for the beginning of the paragraph
        return match[1] + match[3] +'</span></p>\n\n  ' + match[1] + ' — ' + match[4]

        # or, if you want only a line break : 
        # return match[1] + match[3] + '<br/>'  + ' — ' + match[4]

NEW VERSION TO BE USED:

Code:

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    import regex
    abort_if_starting_with_emdash = regex.match('\\s—', match[3])
    
    if match[2] or abort_if_starting_with_emdash:
        return match[0]
    
    else:
        return match[1] + match[3] +'</span></p>\n\n  ' + match[1] + ' — ' + match[4]

To execute this function, select "regex-function" in the drop down (instead of "regex"), click on Create/Edit and past the text of the function.

You've got the idea, I guess you will be able to adapt it if you have some slightly different needs.
Since you seem to need to make some complex substitutions, I guess you should find a tuto for using regexes, there is a quite good one in the help in the site of calibre. The site I've given the URL in my first message is a reference, not a tuto, it is not for learning the basis.
The site https://regex101.com may help you to construct your regexes (select PCRE as a flavor)

02-27-2024, 03:09 AM	#15
lomkiri Groupie Posts: 173 Karma: 1497966 Join Date: Jul 2021 Device: N/A	Ok, your initial explanation was really unclear, I understood that the sentence you wanted to select was <p class="calibre8"> <span class="calibre3">« A long sentence », said the man. « Then a second part. »</span> </p> and I made the regex for it *(<p class="calibre8"> <span class="calibre3">)(?<! — )(«[^»]+»,.?\.) (« ) BUT let see if I really get what you mean : I understand that you want to target this sentence <p class="calibre8"> <span class="calibre3">He came through the door. « I'm here », he said.</span> </p> and transform it to <p class="calibre8"> <span class="calibre3">He came through the door.</span></p> <p class="calibre8"> <span class="calibre3"> — « I'm here », he said.</span> </p> but not this one : <p class="calibre8"> <span class="calibre3"> — « Sentence ending with a comma, » said the man. « Then a second part. »</span> </p>** Is it OK ? (as said pdurrant, you should have done you requests in this way, with examples and counter-examples, so it's much easier to understand.) In that case, I'll do a little different, with the help of a regex-function. The regex will be : Code: (<p class="calibre8"> <span class="calibre3">)(\s—\s«.+?,\s»\s)?([^.]+.) («) Explanation : group 1 : <p class="calibre8"> <span class="calibre3"> group 2 : \s—\s« bla,\s»\s (note : I take that this dialog must end with ", »", if it is " »," it won't be selected in the group 2 and the paragraph will be split. If you don't want this (i.e. mandatory comma before the quote), change the regex accordingly group 3 : The sentence before the next quote (shall the group 2 exist or not) not in any group (won't be kept if we split) : <space> group 4 : « The group 2 can be missing since it is in the form (expr)?. Then, if missing, match[2] will be empty in the function, so this value will be tested to know if the line has to be split (group 2 empty) or not. You said that you may have some &nbsp, so I put \s instead of <space>, it matches all types of spaces "Dot all" must be unchecked (in french : Le point correspond à tout) OLD CODE, DON'T USE IT: (see why in my next post) The function, auto-explicative (comments begin with #), is: Code: def replace(match, number, file_name, metadata, dictionaries, data, functions, args, kwargs): if match[2]: # match[2] (the group 2) is " — « blabla, » " # If we have a group 2 (match[2] not empty) don't do anything (match[0] is the whole selection) return match[0] else: # We don't have a group 2, so the paragraph must be split # (the paragraph was selected by the regex, so we have a dialog in it) # match[1] is the html code for the beginning of the paragraph return match[1] + match[3] +'</span></p>\n\n ' + match[1] + ' — ' + match[4] # or, if you want only a line break : # return match[1] + match[3] + '<br/>' + ' — ' + match[4] NEW VERSION TO BE USED: Code: def replace(match, number, file_name, metadata, dictionaries, data, functions, args, *kwargs): import regex abort_if_starting_with_emdash = regex.match('\\s—', match[3]) if match[2] or abort_if_starting_with_emdash: return match[0] else: return match[1] + match[3] +'</span></p>\n\n ' + match[1] + ' — ' + match[4] To execute this function, select "regex-function" in the drop down (instead of "regex"), click on Create/Edit and past the text of the function. You've got the idea, I guess you will be able to adapt it if you have some slightly different needs. Since you seem to need to make some complex substitutions, I guess you should find a tuto for using regexes, there is a quite good one in the help in the site of calibre. The site I've given the URL in my first message is a reference, not a tuto, it is not for learning the basis. The site https://regex101.com may help you to construct your regexes (select PCRE as a flavor) Last edited by lomkiri; 02-27-2024 at 11:11 AM. Reason: Correction of the function*