find dialogues with missing closing inverted commas

SilvioTO · 04-21-2023, 07:28 AM

I would like to use regex to find the following error in my book:

«dialogue (missing » at the end of phrase).

correct string:
«dialogue»

Tried some code, but I only be able to find a text between « and ».
Thanks for your help.

Turtle91 · 04-21-2023, 08:49 AM

You should be able to limit your FIND to a single line using the options drop-down. Then use a negative lookahead to limit it to lines that do NOT have the »

For example:
Find: «(.*?)(?!»)
Replace: «\1»

SilvioTO · 04-21-2023, 10:36 AM

In the options of the Find and Replace section at the center bottom of the Sigil window, I cannot find the single line option.
the «(.*?)(?!») code find all phrases that start with « character, with or without »

where i'm wrong?

Turtle91 · 04-21-2023, 11:47 AM

Sorry, I’m away from my computer, but IIRC, you want to un-check Dot-All.

From the Users guide:

Quote:

DotAll: This regex option prepends (?s) to all regex searches and is used when you want .* to match any character, even across lines.

KevinH · 04-21-2023, 11:48 AM

Uncheck Regex flag "Dot All" "Dot All" means that a "." will match all characters including a new line character. By unchecking it, you limit wildcard matching to a single line.

FWIW, using that regex and Sigil's replacement table, you should be able to quickly scan all matches for the few problem cases. If not, try a different regex to limit things even further.

Check out the Sigil Users Guide for more info.

SilvioTO · 04-22-2023, 06:19 AM

Thank you both for your help, you have put me on the right track.
The code «(.*?)(?!») don't work despite I unceck "DotAll" flag, so my next step will be to study regex syntax better on the Sigil manual.

Have a very nice day and...

again.

Turtle91 · 04-22-2023, 09:21 AM

That should work.

Can you post an example of the code you are searching through so we can see where the issue might be??

Tex2002ans · 04-22-2023, 09:35 PM

Quote:

Originally Posted by SilvioTO

I would like to use regex to find the following error in my book:

«dialogue (missing » at the end of phrase).

correct string:
«dialogue»

This is the Regex I used for many years:

Search: (“[^”\r\n]*)

Instead of using LEFT/RIGHT DOUBLE QUOTES, substitute whatever quotations you need for your language.

In your case, you'll use left/right guillemets:

Search: («[^»\r\n]*)

This will catch lines 1+3:

Code:

<p>«This is a test.</p>
<p>And this is more «This is a test.» And more.</p>
<p>Testing. «This is a test.</p>
<p>«This is a test.»</p>

but skip 2+4.

- - -

Side Note: Nowadays, I use Toxaris's EPUB Tools + "Dialogue Check", which is/was the ultimate way to catch all quotation mark errors. I wrote about it in detail back in:

2016: "How to convert straight quotes to smart 'curly' typographer's quotes"

While the pure regex method works in most cases, it will not work on heavily nested (or mismatched) quotation marks.

To catch all left/right or outer/inner quotation mark errors, you definitely need something more smart that:

goes forwards/backwards.
- Catching really hard cases + paragraphs with multiple sets of dialogue.
handles false positives like apostrophes vs. SINGLE QUOTES.
- Saving tons of time.

For a little more info on that, see my post in:

2016: "Regex Function about «» and “”"

Sadly, Toxaris's EPUB Tools only works in Microsoft Word... and as of a few years ago, Toxaris stopped maintaining it + his site went down.

I did save a copy + attach it to this 2022 post though.

- - -

Quote:

Originally Posted by SilvioTO

Tried some code, but I only be able to find a text between « and ».

Heh, I wrote about that in detail too:

2020: "Is there a way to edit EPUB or Mobi files such that the Quotes are in BOLD?"

If you ever needed to catch/tag all dialogue in a book for some reason, there's your answer too.

JSWolf · 04-23-2023, 05:51 AM

Quote:

Originally Posted by Turtle91

You should be able to limit your FIND to a single line using the options drop-down. Then use a negative lookahead to limit it to lines that do NOT have the »

For example:
Find: «(.*?)(?!»)
Replace: «\1»

But what if the dialog is something like «Hello» she said.

You would end up with «Hello she said.»

SilvioTO · 04-23-2023, 06:04 AM

The solution proposed by Tex2002ans works perfectly and is a great help. However, to be perfect, it should be able to detect errors when two sentences are in the same paragraph. In the following example, only the last absent closed guillemet is detected, but not the first one.

Code:

<p>«Have you been here a long time Terry?» ... «About five days» he replied.</p>

Another solution I found is less sophisticated and requires more visual attention, but, in case of errors, it highlights anomalies in the code that easily catch the eye making it useful:

Code:

«[^»]*»

Trivially it selects all sentences closed by two guillemets, even if they occur within the same paragraph.

Perhaps making two passes of the entire text with the two solutions might eliminate errors altogether.

Tex2002ans · 04-23-2023, 05:59 PM

Quote:

Originally Posted by SilvioTO

The solution proposed by Tex2002ans works perfectly and is a great help. However, to be perfect, it should be able to detect errors when two sentences are in the same paragraph.

Then get rid of that final :

Find: («[^»\r\n]*)

This will find each of the open quote, all the way to a closing quote OR end of line:

«Have you been here a long time Terry?» ... «About five days» he replied.

and it will catch stuff like:

«Have you been here a long time Terry?» ... «About five days he replied.
- 2nd close missing.
«Have you been here a long time Terry? ... «About five days» he replied.
- 1st close missing.
«Have you been here a long time Terry? ... «About five days he replied.
- 1st AND 2nd close missing.
- (Although you'd only catch 2nd after correcting the 1st one.)

But now you'll have an absolute TON of false positives to look through...

Quote:

Originally Posted by SilvioTO

Another solution I found is less sophisticated and requires more visual attention, but, in case of errors, it highlights anomalies in the code that easily catch the eye making it useful:

Heh. I'd strongly recommend learning about Sigil's fantastic:

Tools > Saved Searches

This lets you create and save multiple Find/Replaces + organize them into Groups.

As described below, you can come up with quite a few sets of "try to catch missing open/close quotes" Regular Expressions.

Those regular expressions alone will carry you like 90% of the way there... but it's that final 10% that's really troublesome.

Quote:

Originally Posted by SilvioTO

Perhaps making two passes of the entire text with the two solutions might eliminate errors altogether.

No. It would require WAY more than 2 sets/passes of Regular Expressions. You have to:

Deal with outer/inner quotes separately.
Catch 2 lefts in a row.
Catch 2 rights in a row.
Catch 1 open, missing close.
Catch 1 close, missing open.

AND you have to also search forwards/backwards.

Then, you'll want to expand it:

Deal with multiple dialogues in a single paragraph.
- THIS is where Regex is pretty dumb.
- A program can be in an "ON/OFF state" based on each left/right quote + where it is in the paragraph + which stage it is in the process.
(Optional) Deal with long paragraphs of continual dialogue.
- Common in Fiction.
- Each paragraph will only start with opening quotes. There is no close quote until the long monologue is completed.

AND you have to deal with (or skip) false positives. (Is this an apostrophe or a single quotation mark?)

AND then you have to deal with all the potential HTML mess in the way too. (, , , class="", [...])

AND then you have to handle inner/outer quotes of all the different languages:

« » = French
“ ” + ‘ ’ = English (US)
‘ ’ + “ ” = English (UK)
„ “ = German

will not work for:

» « = Danish
” ” + ’ ’ = Swedish
„ ” = Hungarian

(That's not a complete list of steps though, but only what I can quickly think of off the top of my head!)

I went into much more detail in the linked posts/threads...

- - -

Toxaris already solved all those issues in his "Dialogue Check" though.

What you would need is a smart program—not just pure regex—that handles outer/inner quotes, and lets you select between them all as needed.

The logic is effectively the same for all languages, just that the symbols switch.

- - -

And some, like English or Swedish, will have a ton more false positives to look through.

Luckily though, you're using guillemets (French?), which is WAY easier to handle compared to English.

In English, you have:

“ ” + ‘ ’

and that 2nd set is the worst, because ’ is used for all sorts of things.

“The name of the article I read was ‘Zeus’ Wife’s Example’.”

1 + 4 are the actual inner quotes.

2 + 3 are actually apostrophes.

If all you're doing is checking Left/Right inner quotes, a dumb algorithm will just think you have:

1 LEFT + 3 RIGHT quotes

An algorithm that handles a lot of those edge-cases I mentioned above—plus searching forward/backwards—would catch different errors at different steps.

PLUS it'll minimize the false positives, which is the real time-waster/time-killer.

I explained a lot of this in:

2020: "Space Between Double and Single Quotes?"

With the pure regex method, you're wasting so much time looking through thousands of correct quotation marks, only to catch that small minority of actual typos/errors/mistakes.

Take it from me... I've probably corrected more quotation mark errors than anyone else on these boards—combined! lol.

rkomar · 04-23-2023, 08:23 PM

Can you not just search for two opening quotes without a closing quote between them?

Tex2002ans · 04-23-2023, 11:12 PM

Quote:

Originally Posted by rkomar

Can you not just search for two opening quotes without a closing quote between them?

Heh, yes, this will catch some of the punctuation errors.

2 LEFTs in a row:

Find: («[^»\r\n]*)«

«Have you been here a long time Terry?» ... «About five days» he replied.
«Have you been here a long time Terry?» ... «About five days he replied.
«Have you been here a long time Terry? ... «About five days» he replied.
«Have you been here a long time Terry? ... «About five days he replied.
«Have you been here a long time Terry?« ... «About five days he replied.
- Wrongly flipped guillemet / OCR error.

So, if you want the pure regex method, you'd create a big ol' collection of "Saved Searches", going from the easy-to-catch stuff—like 2 LEFTs or 2 RIGHTs in a row—all the way down to the hardest-to-catch.

SilvioTO · 04-24-2023, 04:42 AM

Thank you very much Tex2002ans for the valuable advice and clarity in expounding it.
As for me, I can only say that I have solved the problem.
Thanks again to everyone.

Turtle91 · 04-24-2023, 10:55 PM

Edit: This is an answer to Jon's question...not trying to repeat what Tex already answered above...

Quote:

Originally Posted by JSWolf

But what if the dialog is something like «Hello» she said.

You would end up with «Hello she said.»

Sorry It took so long to reply. I was off being a kid in Orlando...

To answer your question, Jon, No. In general, the ? in the capture group forces it to be a minimal match so it would stop capturing at the first legal stopping location. Your example has a closing » so it would not trigger a match.

However, there are two points that I would change based on the discussion above (and my chance to test now that I'm home - which I couldn't do on my phone).

Instead of using the negative lookahead on the » I would use a positive lookahead on the
I would add the » to a 'not' in the capture group

Code:

«(.[^»]*?)(?=</p>)

That would capture anything preceded by a « that doesn't have a » followed by a , without capturing the (eg. just the text).

You are correct that the «\1» would place guillemets around the entire captured text, but this process was never intended to be completely automatic. There is no way for the Find/Replace to discern what is dialogue and what is not. If you use the "Replacements (Delete Unwanted Replacements)" table (see attached image) you would quickly be able to see which sections have an opening « with no closing » highlighted - with what it would look like After the replace. As KevinH mentioned above, you could quickly remove any lines that didn't match the change you wanted...apply changes...then run the find again with a different replace criteria.

ie. the above would not find "«Hello she said. «Hello she said.»" until you ran a different Find with the « in place of the look ahead Like: «(.[^»]*?)«

That would probably be the fastest (least # of iterations) way to search through the text to find those types of errors.

04-21-2023, 07:28 AM	#1
SilvioTO Junior Member Posts: 7 Karma: 10 Join Date: Nov 2010 Device: none	find dialogues with missing closing inverted commas I would like to use regex to find the following error in my book: «dialogue (missing » at the end of phrase). correct string: «dialogue» Tried some code, but I only be able to find a text between « and ». Thanks for your help.

04-21-2023, 08:49 AM	#2
Turtle91 A Hairy Wizard Posts: 3,439 Karma: 20456789 Join Date: Dec 2012 Location: Charleston, SC today Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire	You should be able to limit your FIND to a single line using the options drop-down. Then use a negative lookahead to limit it to lines that do NOT have the » For example: Find: «(.?)(?!») Replace: «\1» Last edited by Turtle91; 04-21-2023 at 08:57 AM.*

04-23-2023, 06:04 AM	#10
SilvioTO Junior Member Posts: 7 Karma: 10 Join Date: Nov 2010 Device: none	The solution proposed by Tex2002ans works perfectly and is a great help. However, to be perfect, it should be able to detect errors when two sentences are in the same paragraph. In the following example, only the last absent closed guillemet is detected, but not the first one. Code: <p>«Have you been here a long time Terry?» ... «About five days» he replied.</p> Another solution I found is less sophisticated and requires more visual attention, but, in case of errors, it highlights anomalies in the code that easily catch the eye making it useful: Code: «[^»]*» Trivially it selects all sentences closed by two guillemets, even if they occur within the same paragraph. Perhaps making two passes of the entire text with the two solutions might eliminate errors altogether.

04-24-2023, 04:42 AM	#14
SilvioTO Junior Member Posts: 7 Karma: 10 Join Date: Nov 2010 Device: none	[SOLVED] Thank you very much Tex2002ans for the valuable advice and clarity in expounding it. As for me, I can only say that I have solved the problem. Thanks again to everyone.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
find missing classes	kcarscadden	Editor	4	12-10-2019 09:16 PM
Finding missing Oxford Commas	avantman42	Writers' Corner	6	07-20-2013 04:29 AM
Missing Commas & Full Stops	Paxman53	Sigil	5	01-09-2013 01:53 PM
find replace - does it auto-fix closing tqags ???	cybmole	Sigil	6	01-19-2011 03:32 PM
close "inverted commas" alone on one line	GillianMary	Workshop	5	10-08-2010 02:09 PM

04-21-2023, 10:36 AM	#3
SilvioTO Junior Member Posts: 7 Karma: 10 Join Date: Nov 2010 Device: none	In the options of the Find and Replace section at the center bottom of the Sigil window, I cannot find the single line option. the «(.*?)(?!») code find all phrases that start with « character, with or without » where i'm wrong?

04-21-2023, 11:48 AM	#5
KevinH Sigil Developer Posts: 9,231 Karma: 6565382 Join Date: Nov 2009 Device: many	Uncheck Regex flag "Dot All" "Dot All" means that a "." will match all characters including a new line character. By unchecking it, you limit wildcard matching to a single line. FWIW, using that regex and Sigil's replacement table, you should be able to quickly scan all matches for the few problem cases. If not, try a different regex to limit things even further. Check out the Sigil Users Guide for more info.

04-22-2023, 06:19 AM	#6
SilvioTO Junior Member Posts: 7 Karma: 10 Join Date: Nov 2010 Device: none	Thank you both for your help, you have put me on the right track. The code «(.*?)(?!») don't work despite I unceck "DotAll" flag, so my next step will be to study regex syntax better on the Sigil manual. Have a very nice day and... again.

04-22-2023, 09:21 AM	#7
Turtle91 A Hairy Wizard Posts: 3,439 Karma: 20456789 Join Date: Dec 2012 Location: Charleston, SC today Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire	That should work. Can you post an example of the code you are searching through so we can see where the issue might be??

04-23-2023, 08:23 PM	#12
rkomar Wizard Posts: 3,093 Karma: 18821071 Join Date: Oct 2010 Location: Sudbury, ON, Canada Device: PRS-505, PB 902, PRS-T1, PB 623, PB 840, PB 633	Can you not just search for two opening quotes without a closing quote between them?