|
|
Thread Tools | Search this Thread |
02-16-2024, 03:21 PM | #1 |
Connoisseur
Posts: 65
Karma: 10
Join Date: Dec 2016
Location: France
Device: Kindle PaperWhite
|
Find a set of char and avoid the same set up to the sentence beginning.
Find a set of char and avoid the same set up to the sentence beginning.
Hello, I am stuck, no way to enter in it. I don't know how to regex (into Calibre) the first occurence of a set of four char and jump this set of char if a chain is at the beginning of the sentence. I have a couple of sentence. I am looking for four characters . « such as .space«space Model#1 is : <p class="calibre21"> <span class="calibre17"> — « This dialogue starts. » blah blah said the man. « Additional dialogue. »</span> </p> Model#2 is : <p class="calibre21"> <span class="calibre17"> Blah blah, few words. « Short additional dialogue starts. »</span> </p> Into model#1, how to avoid to find the set in the second part. « Additional dialogue (because there is "> — « at the beginning of the sentence after <p class="calibre21"> <span class="calibre17">. Into model#2, how to find the . « Short additional ...( because there is NOT "> — « at the beginning of the sentence). Hope I am clear. Thank you for the tips. Best regards. |
02-17-2024, 04:45 AM | #2 |
The Grand Mouse 高貴的老鼠
Posts: 71,510
Karma: 306214458
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Voyage
|
Am I right in thinking that what you really want is the first «space in a paragraph and not any subsequent occurrence of that set of characters?
And what do you want to do with it once you've found it? Just find it, or do you want to do some replacement as well? |
Advert | |
|
02-17-2024, 12:18 PM | #3 |
Zealot
Posts: 136
Karma: 1000102
Join Date: Jul 2021
Device: N/A
|
You have to use a negative lookbehind (see this site, for example)
Code:
(?<!">\s—\s«\s.+?)(\.\s«\s) Explanation : (?<! expr) ==> negative lookbehind (fail if "expr" is found) ">\s—\s«\s.+? ==> if found, don't select the expression that follows this group) in witch .+? means a group of any char (at least one char) until the next expression (not greedy) (\.\s«\s) ==> the expression you're looking for (it is the group 1) Edit : In the search options, "Dot all" must be unchecked Last edited by lomkiri; 02-18-2024 at 08:31 AM. |
02-25-2024, 12:13 PM | #4 |
Connoisseur
Posts: 65
Karma: 10
Join Date: Dec 2016
Location: France
Device: Kindle PaperWhite
|
Hello pdurrant and lomkiri,
Thank you for your help. I appreciate your good tutorial with negative lookbehind that works fine to find space«space : (?<!">\s—\s«\s.+?)(\.\s«\s), great help as it's a bit difficult to enter the site. In my sentences I have two space«space : One at the begining of the sentence, one later at any place. It's the dialogue of a same caracter in two sentences. « Blahblah said the man. « Blahblah again. The first occurence has two html marks before <p class="calibre8"> <span class="calibre3"> space«space to change in <p class="calibre8"> <span class="calibre3"> — space«space The second space«space has no html mark, just a dotspace«space Its an opening dialogue dash for one character. I want to had an em dash ( — ) to the first occurence <p class="calibre8"> <span class="calibre3"> space«space. I have got it with regex in a first pass. No problem. A typical dialogue sentence after the first regex pass is <p class="calibre8"> <span class="calibre3"> — « There is a long sentence ending with a comma », said the man.space«spaceThen a second part of the blahblah dotspace»</span> </p> This is the second . « (dotspace«space) I don't want to grab by regex if a dash is before at the beginning. How to find : IF <p class="calibre8"> <span class="calibre3"> NO EM DASH at the beginning of the sentence with a long lenght variable text, said the man AND dot NO EM DASH space«space ie (dotspace«space) Second part of the blahblah.dotspace»spec</span> </p> Then I will insert a CR (Carriage Return). I stubbornly tried to find the solution with your tutorial but I'm stuck, a kind of lark's mirror. Sorry for my delayed answer. The bold and increased size are for easy reading, no bad mood on my side. Best regards. |
02-25-2024, 12:34 PM | #5 |
The Grand Mouse 高貴的老鼠
Posts: 71,510
Karma: 306214458
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Voyage
|
So you want to find things like
<p class="calibre8"> <span class="calibre3"> — « There is a long sentence ending with a comma », said the man. « Then a second part of the blahblah dot »</span> </p> and turn them into <p class="calibre8"> <span class="calibre3"> — « There is a long sentence ending with a comma », said the man.</span> </p> <p class="calibre8"> <span class="calibre3"> « Then a second part of the blahblah dot »</span> </p> or into <p class="calibre8"> <span class="calibre3"> — « There is a long sentence ending with a comma », said the man.<br /> « Then a second part of the blahblah dot »</span> </p> The search term is the same either way. (<p class="calibre8"> <span class="calibre3"> — «[^»]+»,.*?\.)( «) (Well, assuming that all your </p> are followed by a new line, otherwise the .*? part of the pattern could run into the next paragraph) And the replacement term is just the first capture group, what you want to insert, and then the second capture group, e.g. \1<br />\2 Last edited by pdurrant; 02-25-2024 at 12:37 PM. |
Advert | |
|
02-25-2024, 04:33 PM | #6 | |
Zealot
Posts: 136
Karma: 1000102
Join Date: Jul 2021
Device: N/A
|
Quote:
So since the expression that shall avoid the replacement is a group of 2 chars (i.e. <space>—), I'm afraid we need a negative lookbehind once more. The thing is that a space at the beginning of the paragraph is useless, is not displayed, and is not recommended. If there wouldn't be any, it would have been easier. Anyway, the way Reinsley displayed his sentences, I would propose this : Code:
(<p class="calibre8"> <span class="calibre3">)(?<! — )(«[^»]+»,.*?\.) (« ) Replace: \1\2</p>\n\n \1\3 (if you want a new paragraph) \1\2<br/>\3 (if you just want a line break) If you just want a <br/>, not a new <p>, there is a lighter variant (lighter in terms of RAM) with only one capturing group : Code:
(?:<p class="calibre8"> <span class="calibre3">)(?<! — )(?:«[^»]+»,.*?\.)\K (« ) replace: <br/>\1 (?:<p class="calibre8"> <span class="calibre3">) --> (?:expr) is a non-capturing group (?<! — ) --> Negative lookbehind: the regex fails if found (?:«[^»]+»,.*?\.) --> Non-capturing group (the rest of the sentence) \K --> Forget all that is before \K, and start the part to replace here <space>(« ) --> <space> and group 1 (the space is out of the capturing group so it won't be retained) Last edited by lomkiri; 02-25-2024 at 04:43 PM. |
|
02-26-2024, 05:13 AM | #7 |
Connoisseur
Posts: 65
Karma: 10
Join Date: Dec 2016
Location: France
Device: Kindle PaperWhite
|
Gents, thank you for your patience.
My explanation was tortuous. I am fitghing against .space«space that must be modified or not. It must be avoided IF Em dash is after the two HTML marks AND found IF after a descriptive sentence. Your negative lookbehind examples are useful, the academic theory on the help site is not easy to grasp. I'm going to mix them to fix the problem, if possible. Two dialogue structures : struc#1 ( no detection of .space«space ) <p class="calibre8"> <span class="calibre3"> — « There is a sentence ending with a comma », said the man.space«spaceThen a second sentence of the same character blahblahspace».</span> </p> stucs#1 comments : <p class= > <span class= > emdashspace«space is the first part of the dialogue THEN there is .space«space for the second part of the dialogue of the same character. No need to find this occurence ( I mean .space«space because the same character speaks and he adds few words ). struc#2 ( detection of .space«space is needed ) <p class="calibre8"> <span class="calibre3"> There is a long descriptive sentence, of variable lenght.space«spaceThen the dialogue starts blahblah and end withspace».</span> </p> stucs#2 comments : The new character does not speak at once, there is a long description of the scene. Then the character speaks. The Em dash is missing before .space«space for the start of the dialogue. I need to grab this one to add a CR and a Em dash. |
02-26-2024, 05:58 AM | #8 | |
Connoisseur
Posts: 65
Karma: 10
Join Date: Dec 2016
Location: France
Device: Kindle PaperWhite
|
Quote:
I can do a simple search and replace to remove the space after the html mark. Best regards |
|
02-26-2024, 06:14 AM | #9 | |
Connoisseur
Posts: 65
Karma: 10
Join Date: Dec 2016
Location: France
Device: Kindle PaperWhite
|
Quote:
IF a Em dash starts after two html marks <p class="> <span class="> I avoid dotspace«space deeper in the sentence until the closing HTML marks. Sorry for my poor description. Hope it's better. Best regards |
|
02-26-2024, 10:26 AM | #10 |
Well trained by Cats
Posts: 29,809
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
CR (Carriage returns ) are not proper HTML for coder use to affect visual
You want <br /> <<< break the line or simply make your content into 2 Paragraphs Code:
<h2>A way too lengthy <br />text ti fit on a single line</h2> text ti fit on a single line |
02-26-2024, 11:01 AM | #11 |
The Grand Mouse 高貴的老鼠
Posts: 71,510
Karma: 306214458
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Voyage
|
Perhaps it would be simpler just to give two or three actual examples of the paragraphs you want to find, and don't want to find, and what you want to turn it into.
Since (as the ducks says), a simple CR/LF in the HTML is most definitely NOT what you want to do, as it will do nothing when displayed, you might want to show us what it should look like when displayed, without markup. |
02-26-2024, 11:08 AM | #12 |
The Grand Mouse 高貴的老鼠
Posts: 71,510
Karma: 306214458
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Voyage
|
But I've read your messages again. I think you mean that, if a paragraph starts with
— « Then you don't want to make any changes in that paragraph. However, if it doesn't start with those characters, and later in the paragraph there is . « Then you want to break the paragraph after the full stop, and have a new paragraph, with a — inserted at the start? so — « I'm speaking », said the man. « don't interrupt! » Should be left alone but He came through the door. « I'm here », he said. should be changed into He came through the door. — « I'm here », he said. Is that right? |
02-26-2024, 12:06 PM | #13 | |
Connoisseur
Posts: 65
Karma: 10
Join Date: Dec 2016
Location: France
Device: Kindle PaperWhite
|
Quote:
Exactly, perfect. I want to highlight only the part in bold .space«space in your exemple : [...]door. « I'm here », he said. And break it as you did. To split the sentence into 2 Paragraphs, I replace by .</span> </p> <p class="calibre8"> <span class="calibre3"> . — « I'm here », he said. As I do a find/replace manually to choose .space«space , it is not glorious and it's about 600 changes for one book. It's a serie of 12 books. The space may sometimes be a nbsp, never at the same place. So I have to do a second pass, about 100 changes. A regex solution will be elegant and a good way to learn something new. Two or three passes with regex is easy to cover all the cases. |
|
02-26-2024, 12:21 PM | #14 | |
Connoisseur
Posts: 65
Karma: 10
Join Date: Dec 2016
Location: France
Device: Kindle PaperWhite
|
Quote:
Thank you for your advices. I split the sentence into two new paragraphs. It keeps a better structure. So I replace the selection by .</span> </p> <p class="calibre8"> <span class="calibre3"> . Best regards |
|
02-27-2024, 02:09 AM | #15 |
Zealot
Posts: 136
Karma: 1000102
Join Date: Jul 2021
Device: N/A
|
Ok, your initial explanation was really unclear, I understood that the sentence you wanted to select was
<p class="calibre8"> <span class="calibre3">« A long sentence », said the man. « Then a second part. »</span> </p> and I made the regex for it (<p class="calibre8"> <span class="calibre3">)(?<! — )(«[^»]+»,.*?\.) (« ) BUT let see if I really get what you mean : I understand that you want to target this sentence <p class="calibre8"> <span class="calibre3">He came through the door. « I'm here », he said.</span> </p> and transform it to <p class="calibre8"> <span class="calibre3">He came through the door.</span></p> <p class="calibre8"> <span class="calibre3"> — « I'm here », he said.</span> </p> but not this one : <p class="calibre8"> <span class="calibre3"> — « Sentence ending with a comma, » said the man. « Then a second part. »</span> </p> Is it OK ? (as said pdurrant, you should have done you requests in this way, with examples and counter-examples, so it's much easier to understand.) In that case, I'll do a little different, with the help of a regex-function. The regex will be : Code:
(<p class="calibre8"> <span class="calibre3">)(\s—\s«.+?,\s»\s)?([^.]+.) («) group 1 : <p class="calibre8"> <span class="calibre3"> group 2 : \s—\s« bla,\s»\s (note : I take that this dialog must end with ", »", if it is " »," it won't be selected in the group 2 and the paragraph will be split. If you don't want this (i.e. mandatory comma before the quote), change the regex accordingly group 3 : The sentence before the next quote (shall the group 2 exist or not) not in any group (won't be kept if we split) : <space> group 4 : « The group 2 can be missing since it is in the form (expr)?. Then, if missing, match[2] will be empty in the function, so this value will be tested to know if the line has to be split (group 2 empty) or not. You said that you may have some  , so I put \s instead of <space>, it matches all types of spaces "Dot all" must be unchecked (in french : Le point correspond à tout) OLD CODE, DON'T USE IT: (see why in my next post) The function, auto-explicative (comments begin with #), is: Code:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs): if match[2]: # match[2] (the group 2) is " — « blabla, » " # If we have a group 2 (match[2] not empty) don't do anything (match[0] is the whole selection) return match[0] else: # We don't have a group 2, so the paragraph must be split # (the paragraph was selected by the regex, so we have a dialog in it) # match[1] is the html code for the beginning of the paragraph return match[1] + match[3] +'</span></p>\n\n ' + match[1] + ' — ' + match[4] # or, if you want only a line break : # return match[1] + match[3] + '<br/>' + ' — ' + match[4] Code:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs): import regex abort_if_starting_with_emdash = regex.match('\\s—', match[3]) if match[2] or abort_if_starting_with_emdash: return match[0] else: return match[1] + match[3] +'</span></p>\n\n ' + match[1] + ' — ' + match[4] You've got the idea, I guess you will be able to adapt it if you have some slightly different needs. Since you seem to need to make some complex substitutions, I guess you should find a tuto for using regexes, there is a quite good one in the help in the site of calibre. The site I've given the URL in my first message is a reference, not a tuto, it is not for learning the basis. The site https://regex101.com may help you to construct your regexes (select PCRE as a flavor) Last edited by lomkiri; 02-27-2024 at 10:11 AM. Reason: Correction of the function |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Trying to set 'Beginning'/start reading page in AZW3 | Siavahda | Conversion | 4 | 04-07-2023 04:50 PM |
Unable to set new hotkeys for jumping to the beginning/end | YogSothoth | Viewer | 2 | 11-18-2022 10:10 PM |
Avoid pdf header and footer in the beginning of chapters | alexandreaquiles | Conversion | 0 | 10-09-2014 03:02 PM |
How to set Beginning page for Kindle | mjlamb | Kindle Formats | 5 | 07-13-2014 07:59 PM |
How to set Kindle "Go to Beginning" marker? | timfrost | Conversion | 0 | 05-17-2011 10:28 AM |