Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Editor

Notices

Reply
 
Thread Tools Search this Thread
Old 02-16-2024, 03:21 PM   #1
reinsley
Connoisseur
reinsley began at the beginning.
 
reinsley's Avatar
 
Posts: 65
Karma: 10
Join Date: Dec 2016
Location: France
Device: Kindle PaperWhite
Find a set of char and avoid the same set up to the sentence beginning.

Find a set of char and avoid the same set up to the sentence beginning.


Hello,

I am stuck, no way to enter in it. I don't know how to regex (into Calibre) the first occurence of a set of four char and jump this set of char if a chain is at the beginning of the sentence.
I have a couple of sentence. I am looking for four characters . « such as .space«space

Model#1 is :
<p class="calibre21"> <span class="calibre17"> — « This dialogue starts. » blah blah said the man. « Additional dialogue. »</span> </p>

Model#2 is :
<p class="calibre21"> <span class="calibre17"> Blah blah, few words. « Short additional dialogue starts. »</span> </p>


Into model#1, how to avoid to find the set in the second part. « Additional dialogue (because there is "> — « at the beginning of the sentence after <p class="calibre21"> <span class="calibre17">.
Into model#2, how to find the . « Short additional ...( because there is NOT "> — « at the beginning of the sentence).

Hope I am clear. Thank you for the tips. Best regards.
reinsley is offline   Reply With Quote
Old 02-17-2024, 04:45 AM   #2
pdurrant
The Grand Mouse 高貴的老鼠
pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.
 
pdurrant's Avatar
 
Posts: 71,510
Karma: 306214458
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Voyage
Am I right in thinking that what you really want is the first «space in a paragraph and not any subsequent occurrence of that set of characters?

And what do you want to do with it once you've found it? Just find it, or do you want to do some replacement as well?
pdurrant is offline   Reply With Quote
Advert
Old 02-17-2024, 12:18 PM   #3
lomkiri
Zealot
lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.
 
lomkiri's Avatar
 
Posts: 136
Karma: 1000102
Join Date: Jul 2021
Device: N/A
You have to use a negative lookbehind (see this site, for example)

Code:
(?<!">\s—\s«\s.+?)(\.\s«\s)
Your selection is in the group 1 (i.e.\1), if you want to make a substitution

Explanation :
(?<! expr) ==> negative lookbehind (fail if "expr" is found)
">\s—\s«\s.+? ==> if found, don't select the expression that follows this group)
in witch .+? means a group of any char (at least one char) until the next expression (not greedy)
(\.\s«\s) ==> the expression you're looking for (it is the group 1)

Edit : In the search options, "Dot all" must be unchecked

Last edited by lomkiri; 02-18-2024 at 08:31 AM.
lomkiri is offline   Reply With Quote
Old 02-25-2024, 12:13 PM   #4
reinsley
Connoisseur
reinsley began at the beginning.
 
reinsley's Avatar
 
Posts: 65
Karma: 10
Join Date: Dec 2016
Location: France
Device: Kindle PaperWhite
Hello pdurrant and lomkiri,


Thank you for your help.

I appreciate your good tutorial with negative lookbehind that works fine to find space«space : (?<!">\s—\s«\s.+?)(\.\s«\s), great help as it's a bit difficult to enter the site.

In my sentences I have two space«space : One at the begining of the sentence, one later at any place. It's the dialogue of a same caracter in two sentences. « Blahblah said the man. « Blahblah again.

The first occurence has two html marks before <p class="calibre8"> <span class="calibre3"> space«space to change in <p class="calibre8"> <span class="calibre3"> — space«space
The second space«space has no html mark, just a dotspace«space

Its an opening dialogue dash for one character. I want to had an em dash ( &#x2014 ) to the first occurence <p class="calibre8"> <span class="calibre3"> space«space. I have got it with regex in a first pass. No problem.


A typical dialogue sentence after the first regex pass is <p class="calibre8"> <span class="calibre3"> — « There is a long sentence ending with a comma », said the man.space«spaceThen a second part of the blahblah dotspace»</span> </p>
This is the second . « (dotspace«space) I don't want to grab by regex if a dash is before at the beginning.

How to find : IF <p class="calibre8"> <span class="calibre3"> NO EM DASH at the beginning of the sentence with a long lenght variable text, said the man AND dot NO EM DASH space«space ie (dotspace«space) Second part of the blahblah.dotspace»spec</span> </p>

Then I will insert a CR (Carriage Return).

I stubbornly tried to find the solution with your tutorial but I'm stuck, a kind of lark's mirror. Sorry for my delayed answer.

The bold and increased size are for easy reading, no bad mood on my side.

Best regards.
reinsley is offline   Reply With Quote
Old 02-25-2024, 12:34 PM   #5
pdurrant
The Grand Mouse 高貴的老鼠
pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.
 
pdurrant's Avatar
 
Posts: 71,510
Karma: 306214458
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Voyage
So you want to find things like
<p class="calibre8"> <span class="calibre3"> — « There is a long sentence ending with a comma », said the man. « Then a second part of the blahblah dot »</span> </p>

and turn them into

<p class="calibre8"> <span class="calibre3"> — « There is a long sentence ending with a comma », said the man.</span> </p>
<p class="calibre8"> <span class="calibre3"> « Then a second part of the blahblah dot »</span> </p>



or into

<p class="calibre8"> <span class="calibre3"> — « There is a long sentence ending with a comma », said the man.<br /> « Then a second part of the blahblah dot »</span> </p>

The search term is the same either way.

(<p class="calibre8"> <span class="calibre3"> — «[^»]+»,.*?\.)( «)


(Well, assuming that all your </p> are followed by a new line, otherwise the .*? part of the pattern could run into the next paragraph)

And the replacement term is just the first capture group, what you want to insert, and then the second capture group, e.g.

\1<br />\2

Last edited by pdurrant; 02-25-2024 at 12:37 PM.
pdurrant is offline   Reply With Quote
Advert
Old 02-25-2024, 04:33 PM   #6
lomkiri
Zealot
lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.
 
lomkiri's Avatar
 
Posts: 136
Karma: 1000102
Join Date: Jul 2021
Device: N/A
Quote:
Originally Posted by pdurrant View Post
So you want to find things like
<p class="calibre8"> <span class="calibre3"> — « […]
I think the OP wants the contrary, he wants to target the paragraph when the sentence doesn't begin with <space><em-dash> (really, it was a bit confusing, I hope I understood correctly)

So since the expression that shall avoid the replacement is a group of 2 chars (i.e. <space>—), I'm afraid we need a negative lookbehind once more.

The thing is that a space at the beginning of the paragraph is useless, is not displayed, and is not recommended. If there wouldn't be any, it would have been easier.

Anyway, the way Reinsley displayed his sentences, I would propose this :
Code:
(<p class="calibre8"> <span class="calibre3">)(?<! — )(«[^»]+»,.*?\.) (« )
Replace:
\1\2</p>\n\n  \1\3 (if you want a new paragraph)
\1\2<br/>\3 (if you just want a line break)
Note : to get rid of the space at the beginning of the 2nd line, I left it outside of all groups)

If you just want a <br/>, not a new <p>, there is a lighter variant (lighter in terms of RAM) with only one capturing group :
Code:
(?:<p class="calibre8"> <span class="calibre3">)(?<! — )(?:«[^»]+»,.*?\.)\K (« )
replace: <br/>\1
Explanation :
(?:<p class="calibre8"> <span class="calibre3">) --> (?:expr) is a non-capturing group
(?<! — ) --> Negative lookbehind: the regex fails if found
(?:«[^»]+»,.*?\.) --> Non-capturing group (the rest of the sentence)
\K --> Forget all that is before \K, and start the part to replace here
<space>(« ) --> <space> and group 1 (the space is out of the capturing group so it won't be retained)

Last edited by lomkiri; 02-25-2024 at 04:43 PM.
lomkiri is offline   Reply With Quote
Old 02-26-2024, 05:13 AM   #7
reinsley
Connoisseur
reinsley began at the beginning.
 
reinsley's Avatar
 
Posts: 65
Karma: 10
Join Date: Dec 2016
Location: France
Device: Kindle PaperWhite
Gents, thank you for your patience.

My explanation was tortuous.
I am fitghing against .space«space that must be modified or not. It must be avoided IF Em dash is after the two HTML marks AND found IF after a descriptive sentence.
Your negative lookbehind examples are useful, the academic theory on the help site is not easy to grasp. I'm going to mix them to fix the problem, if possible.

Two dialogue structures :

struc#1 ( no detection of .space«space )
<p class="calibre8"> <span class="calibre3"> — « There is a sentence ending with a comma », said the man.space«spaceThen a second sentence of the same character blahblahspace».</span> </p>

stucs#1 comments : <p class= > <span class= > emdashspace«space is the first part of the dialogue THEN there is .space«space for the second part of the dialogue of the same character. No need to find this occurence ( I mean .space«space because the same character speaks and he adds few words ).



struc#2 ( detection of .space«space is needed )
<p class="calibre8"> <span class="calibre3"> There is a long descriptive sentence, of variable lenght.space«spaceThen the dialogue starts blahblah and end withspace».</span> </p>

stucs#2 comments : The new character does not speak at once, there is a long description of the scene. Then the character speaks. The Em dash is missing before .space«space for the start of the dialogue.
I need to grab this one to add a CR and a Em dash.
reinsley is offline   Reply With Quote
Old 02-26-2024, 05:58 AM   #8
reinsley
Connoisseur
reinsley began at the beginning.
 
reinsley's Avatar
 
Posts: 65
Karma: 10
Join Date: Dec 2016
Location: France
Device: Kindle PaperWhite
Quote:
Originally Posted by lomkiri View Post
I think the OP wants the contrary, he wants to target the paragraph when the sentence doesn't begin with <space><em-dash> (really, it was a bit confusing, I hope I understood correctly)

So since the expression that shall avoid the replacement is a group of 2 chars (i.e. <space>—), I'm afraid we need a negative lookbehind once more.

The thing is that a space at the beginning of the paragraph is useless, is not displayed, and is not recommended. If there wouldn't be any, it would have been easier.
Correct, if the sentence starts with <p class> <span class><space><em-dash> I do not detect the next dotspace«space in the second part of the dialogue.

I can do a simple search and replace to remove the space after the html mark.

Best regards
reinsley is offline   Reply With Quote
Old 02-26-2024, 06:14 AM   #9
reinsley
Connoisseur
reinsley began at the beginning.
 
reinsley's Avatar
 
Posts: 65
Karma: 10
Join Date: Dec 2016
Location: France
Device: Kindle PaperWhite
Quote:
Originally Posted by pdurrant View Post
So you want to find things like
<p class="calibre8"> <span class="calibre3"> — « There is a long sentence ending with a comma », said the man. « Then a second part of the blahblah dot »</span> </p>

and turn them into

<p class="calibre8"> <span class="calibre3"> — « There is a long sentence ending with a comma », said the man.</span> </p>
<p class="calibre8"> <span class="calibre3"> « Then a second part of the blahblah dot »</span> </p>


No, IF a descriptive sentence starts after two html marks <p class="> <span class="> AND I find later in the sentence dotspace«space , I want to grab it to insert CR and Em dash.

IF a Em dash starts after two html marks <p class="> <span class="> I avoid dotspace«space deeper in the sentence until the closing HTML marks.

Sorry for my poor description. Hope it's better.
Best regards
reinsley is offline   Reply With Quote
Old 02-26-2024, 10:26 AM   #10
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,809
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
CR (Carriage returns ) are not proper HTML for coder use to affect visual

You want <br /> <<< break the line
or
simply make your content into 2 Paragraphs

Code:
<h2>A way too lengthy <br />text ti fit on a single line</h2>
A way too lengthy
text ti fit on a single line
theducks is offline   Reply With Quote
Old 02-26-2024, 11:01 AM   #11
pdurrant
The Grand Mouse 高貴的老鼠
pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.
 
pdurrant's Avatar
 
Posts: 71,510
Karma: 306214458
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Voyage
Perhaps it would be simpler just to give two or three actual examples of the paragraphs you want to find, and don't want to find, and what you want to turn it into.

Since (as the ducks says), a simple CR/LF in the HTML is most definitely NOT what you want to do, as it will do nothing when displayed, you might want to show us what it should look like when displayed, without markup.
pdurrant is offline   Reply With Quote
Old 02-26-2024, 11:08 AM   #12
pdurrant
The Grand Mouse 高貴的老鼠
pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.
 
pdurrant's Avatar
 
Posts: 71,510
Karma: 306214458
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Voyage
But I've read your messages again. I think you mean that, if a paragraph starts with

— «

Then you don't want to make any changes in that paragraph. However, if it doesn't start with those characters, and later in the paragraph there is

. «

Then you want to break the paragraph after the full stop, and have a new paragraph, with a — inserted at the start?

so

— « I'm speaking », said the man. « don't interrupt! »

Should be left alone but

He came through the door. « I'm here », he said.

should be changed into

He came through the door.
— « I'm here », he said.


Is that right?
pdurrant is offline   Reply With Quote
Old 02-26-2024, 12:06 PM   #13
reinsley
Connoisseur
reinsley began at the beginning.
 
reinsley's Avatar
 
Posts: 65
Karma: 10
Join Date: Dec 2016
Location: France
Device: Kindle PaperWhite
Quote:
Originally Posted by pdurrant View Post

so

— « I'm speaking », said the man. « don't interrupt! »

Should be left alone but

He came through the door. « I'm here », he said.

should be changed into

He came through the door.
— « I'm here », he said.


Is that right?

Exactly, perfect. I want to highlight only the part in bold .space«space in your exemple : [...]door. « I'm here », he said.
And break it as you did.
To split the sentence into 2 Paragraphs, I replace by
.</span> </p>
<p class="calibre8"> <span class="calibre3"> .
— «
I'm here », he said.

As I do a find/replace manually to choose .space«space , it is not glorious and it's about 600 changes for one book. It's a serie of 12 books.
The space may sometimes be a nbsp, never at the same place. So I have to do a second pass, about 100 changes. A regex solution will be elegant and a good way to learn something new. Two or three passes with regex is easy to cover all the cases.
reinsley is offline   Reply With Quote
Old 02-26-2024, 12:21 PM   #14
reinsley
Connoisseur
reinsley began at the beginning.
 
reinsley's Avatar
 
Posts: 65
Karma: 10
Join Date: Dec 2016
Location: France
Device: Kindle PaperWhite
Quote:
Originally Posted by theducks View Post
CR (Carriage returns ) are not proper HTML for coder use to affect visual

You want <br /> <<< break the line
or
simply make your content into 2 Paragraphs
Hello theduck,

Thank you for your advices.

I split the sentence into two new paragraphs. It keeps a better structure.
So I replace the selection by
.</span> </p>
<p class="calibre8"> <span class="calibre3"> .

Best regards
reinsley is offline   Reply With Quote
Old 02-27-2024, 02:09 AM   #15
lomkiri
Zealot
lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.
 
lomkiri's Avatar
 
Posts: 136
Karma: 1000102
Join Date: Jul 2021
Device: N/A
Ok, your initial explanation was really unclear, I understood that the sentence you wanted to select was
<p class="calibre8"> <span class="calibre3">« A long sentence », said the man. « Then a second part. »</span> </p>
and I made the regex for it
(<p class="calibre8"> <span class="calibre3">)(?<! — )(«[^»]+»,.*?\.) (« )

BUT

let see if I really get what you mean : I understand that you want to target this sentence
<p class="calibre8"> <span class="calibre3">He came through the door. « I'm here », he said.</span> </p>

and transform it to
<p class="calibre8"> <span class="calibre3">He came through the door.</span></p>
<p class="calibre8"> <span class="calibre3"> — « I'm here », he said.</span> </p>


but not this one :
<p class="calibre8"> <span class="calibre3"> — « Sentence ending with a comma, » said the man. « Then a second part. »</span> </p>

Is it OK ? (as said pdurrant, you should have done you requests in this way, with examples and counter-examples, so it's much easier to understand.)

In that case, I'll do a little different, with the help of a regex-function.

The regex will be :

Code:
(<p class="calibre8"> <span class="calibre3">)(\s—\s«.+?,\s»\s)?([^.]+.) («)
Explanation :
group 1 : <p class="calibre8"> <span class="calibre3">
group 2 : \s—\s« bla,\s»\s
(note : I take that this dialog must end with ", »", if it is " »," it won't be selected in the group 2 and the paragraph will be split. If you don't want this (i.e. mandatory comma before the quote), change the regex accordingly
group 3 : The sentence before the next quote (shall the group 2 exist or not)
not in any group (won't be kept if we split) : <space>
group 4 : «

The group 2 can be missing since it is in the form (expr)?. Then, if missing, match[2] will be empty in the function, so this value will be tested to know if the line has to be split (group 2 empty) or not.

You said that you may have some &nbsp, so I put \s instead of <space>, it matches all types of spaces

"Dot all" must be unchecked (in french : Le point correspond à tout)

OLD CODE, DON'T USE IT: (see why in my next post)
The function, auto-explicative (comments begin with #), is:
Code:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    if match[2]:
        # match[2] (the group 2) is " — « blabla, » "
        # If we have a group 2 (match[2] not empty) don't do anything (match[0] is the whole selection)
        return match[0]

    else:
	# We don't have a group 2, so the paragraph must be split
	# (the paragraph was selected by the regex, so we have a dialog in it)
	# match[1] is the html code for the beginning of the paragraph
        return match[1] + match[3] +'</span></p>\n\n  ' + match[1] + ' — ' + match[4]

        # or, if you want only a line break : 
        # return match[1] + match[3] + '<br/>'  + ' — ' + match[4]
NEW VERSION TO BE USED:
Code:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    import regex
    abort_if_starting_with_emdash = regex.match('\\s—', match[3])
    
    if match[2] or abort_if_starting_with_emdash:
        return match[0]
    
    else:
        return match[1] + match[3] +'</span></p>\n\n  ' + match[1] + ' — ' + match[4]
To execute this function, select "regex-function" in the drop down (instead of "regex"), click on Create/Edit and past the text of the function.

You've got the idea, I guess you will be able to adapt it if you have some slightly different needs.
Since you seem to need to make some complex substitutions, I guess you should find a tuto for using regexes, there is a quite good one in the help in the site of calibre. The site I've given the URL in my first message is a reference, not a tuto, it is not for learning the basis.
The site https://regex101.com may help you to construct your regexes (select PCRE as a flavor)

Last edited by lomkiri; 02-27-2024 at 10:11 AM. Reason: Correction of the function
lomkiri is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Trying to set 'Beginning'/start reading page in AZW3 Siavahda Conversion 4 04-07-2023 04:50 PM
Unable to set new hotkeys for jumping to the beginning/end YogSothoth Viewer 2 11-18-2022 10:10 PM
Avoid pdf header and footer in the beginning of chapters alexandreaquiles Conversion 0 10-09-2014 03:02 PM
How to set Beginning page for Kindle mjlamb Kindle Formats 5 07-13-2014 07:59 PM
How to set Kindle "Go to Beginning" marker? timfrost Conversion 0 05-17-2011 10:28 AM


All times are GMT -4. The time now is 12:48 PM.


MobileRead.com is a privately owned, operated and funded community.