Quote:
Originally Posted by SilvioTO
The solution proposed by Tex2002ans works perfectly and is a great help. However, to be perfect, it should be able to detect errors when two sentences are in the same paragraph.
|
Then get rid of that final </p>:
Find: («[^»\r\n]*)
This will find each of the open quote, all the way to a closing quote OR end of line:
- <p>«Have you been here a long time Terry?» ... «About five days» he replied.</p>
and it will catch stuff like:
- <p>«Have you been here a long time Terry?» ... «About five days he replied.</p>
- <p>«Have you been here a long time Terry? ... «About five days» he replied.</p>
- <p>«Have you been here a long time Terry? ... «About five days he replied.</p>
- 1st AND 2nd close missing.
- (Although you'd only catch 2nd after correcting the 1st one.)
But now you'll have an absolute TON of false positives to look through...
Quote:
Originally Posted by SilvioTO
Another solution I found is less sophisticated and requires more visual attention, but, in case of errors, it highlights anomalies in the code that easily catch the eye making it useful:
|
Heh. I'd strongly recommend learning about Sigil's fantastic:
This lets you create and save multiple Find/Replaces + organize them into Groups.
As described below, you can come up with quite a few sets of "try to catch missing open/close quotes" Regular Expressions.
Those regular expressions alone will carry you like 90% of the way there... but it's that final 10% that's
really troublesome.
Quote:
Originally Posted by SilvioTO
Perhaps making two passes of the entire text with the two solutions might eliminate errors altogether.
|
No. It would require WAY more than 2 sets/passes of Regular Expressions. You have to:
- Deal with outer/inner quotes separately.
- Catch 2 lefts in a row.
- Catch 2 rights in a row.
- Catch 1 open, missing close.
- Catch 1 close, missing open.
AND you have to also search forwards/backwards.
Then, you'll want to expand it:
- Deal with multiple dialogues in a single paragraph.
- THIS is where Regex is pretty dumb.
- A program can be in an "ON/OFF state" based on each left/right quote + where it is in the paragraph + which stage it is in the process.
- (Optional) Deal with long paragraphs of continual dialogue.
- Common in Fiction.
- Each paragraph will only start with opening quotes. There is no close quote until the long monologue is completed.
AND you have to deal with (or skip) false positives. (Is this an apostrophe or a single quotation mark?)
AND then you have to deal with all the potential HTML mess in the way too. (<span>, <i>, <em>, class="", [...])
AND then you have to handle inner/outer quotes of all the different languages:
- « » = French
- “ ” + ‘ ’ = English (US)
- ‘ ’ + “ ” = English (UK)
- „ “ = German
will not work for:
- » « = Danish
- ” ” + ’ ’ = Swedish
- „ ” = Hungarian
(That's not a complete list of steps though, but only what I can quickly think of off the top of my head!)
I went into much more detail in the linked posts/threads...
- - -
Toxaris already solved all those issues in his "Dialogue Check" though.
What you would need is a smart program—not just pure regex—that handles outer/inner quotes, and lets you select between them all as needed.
The logic is effectively the same for all languages, just that the symbols switch.
- - -
And some, like English or Swedish, will have
a ton more false positives to look through.
Luckily though, you're using guillemets (French?), which is WAY easier to handle compared to English.
In English, you have:
and that 2nd set is the worst, because ’ is used for all sorts of things.
- “The name of the article I read was ‘Zeus’ Wife’s Example’.”
1 + 4 are the actual inner quotes.
2 + 3 are actually apostrophes.
If all you're doing is checking Left/Right inner quotes, a dumb algorithm will just think you have:
An algorithm that handles a lot of those edge-cases I mentioned above—plus searching forward/backwards—would catch different errors at different steps.
PLUS it'll minimize the false positives, which is the
real time-waster/time-killer.
I explained a lot of this in:
With the pure regex method, you're wasting so much time looking through
thousands of correct quotation marks, only to catch that small minority of actual typos/errors/mistakes.
Take it from me... I've probably corrected more quotation mark errors than anyone else on these boards—combined! lol.