MobileRead Forums - View Single Post - find dialogues with missing closing inverted commas

Tex2002ans · 04-23-2023, 04:59 PM

Quote:

Originally Posted by SilvioTO

The solution proposed by Tex2002ans works perfectly and is a great help. However, to be perfect, it should be able to detect errors when two sentences are in the same paragraph.

Then get rid of that final :

Find: («[^»\r\n]*)

This will find each of the open quote, all the way to a closing quote OR end of line:

«Have you been here a long time Terry?» ... «About five days» he replied.

and it will catch stuff like:

«Have you been here a long time Terry?» ... «About five days he replied.
- 2nd close missing.
«Have you been here a long time Terry? ... «About five days» he replied.
- 1st close missing.
«Have you been here a long time Terry? ... «About five days he replied.
- 1st AND 2nd close missing.
- (Although you'd only catch 2nd after correcting the 1st one.)

But now you'll have an absolute TON of false positives to look through...

Quote:

Originally Posted by SilvioTO

Another solution I found is less sophisticated and requires more visual attention, but, in case of errors, it highlights anomalies in the code that easily catch the eye making it useful:

Heh. I'd strongly recommend learning about Sigil's fantastic:

Tools > Saved Searches

This lets you create and save multiple Find/Replaces + organize them into Groups.

As described below, you can come up with quite a few sets of "try to catch missing open/close quotes" Regular Expressions.

Those regular expressions alone will carry you like 90% of the way there... but it's that final 10% that's really troublesome.

Quote:

Originally Posted by SilvioTO

Perhaps making two passes of the entire text with the two solutions might eliminate errors altogether.

No. It would require WAY more than 2 sets/passes of Regular Expressions. You have to:

Deal with outer/inner quotes separately.
Catch 2 lefts in a row.
Catch 2 rights in a row.
Catch 1 open, missing close.
Catch 1 close, missing open.

AND you have to also search forwards/backwards.

Then, you'll want to expand it:

Deal with multiple dialogues in a single paragraph.
- THIS is where Regex is pretty dumb.
- A program can be in an "ON/OFF state" based on each left/right quote + where it is in the paragraph + which stage it is in the process.
(Optional) Deal with long paragraphs of continual dialogue.
- Common in Fiction.
- Each paragraph will only start with opening quotes. There is no close quote until the long monologue is completed.

AND you have to deal with (or skip) false positives. (Is this an apostrophe or a single quotation mark?)

AND then you have to deal with all the potential HTML mess in the way too. (, , , class="", [...])

AND then you have to handle inner/outer quotes of all the different languages:

« » = French
“ ” + ‘ ’ = English (US)
‘ ’ + “ ” = English (UK)
„ “ = German

will not work for:

» « = Danish
” ” + ’ ’ = Swedish
„ ” = Hungarian

(That's not a complete list of steps though, but only what I can quickly think of off the top of my head!)

I went into much more detail in the linked posts/threads...

- - -

Toxaris already solved all those issues in his "Dialogue Check" though.

What you would need is a smart program—not just pure regex—that handles outer/inner quotes, and lets you select between them all as needed.

The logic is effectively the same for all languages, just that the symbols switch.

- - -

And some, like English or Swedish, will have a ton more false positives to look through.

Luckily though, you're using guillemets (French?), which is WAY easier to handle compared to English.

In English, you have:

“ ” + ‘ ’

and that 2nd set is the worst, because ’ is used for all sorts of things.

“The name of the article I read was ‘Zeus’ Wife’s Example’.”

1 + 4 are the actual inner quotes.

2 + 3 are actually apostrophes.

If all you're doing is checking Left/Right inner quotes, a dumb algorithm will just think you have:

1 LEFT + 3 RIGHT quotes

An algorithm that handles a lot of those edge-cases I mentioned above—plus searching forward/backwards—would catch different errors at different steps.

PLUS it'll minimize the false positives, which is the real time-waster/time-killer.

I explained a lot of this in:

2020: "Space Between Double and Single Quotes?"

With the pure regex method, you're wasting so much time looking through thousands of correct quotation marks, only to catch that small minority of actual typos/errors/mistakes.

Take it from me... I've probably corrected more quotation mark errors than anyone else on these boards—combined! lol.