Quote:
Originally Posted by senhal
I'd like to find paragraphs in which the number of « and » don't match. This can be useful to find some OCR typical errors regarding dialogues (« not closed or viceversa, « or » wrongly recognized...)
|
For a very simple check, I used to use this Regex along these lines to search for opening quotes with no closing quotes:
Search: (“[^”]*)</p>
You could probably just substitute this in to catch some opening guillemet with no closing guillemet:
Search: («[^»]*)</p>
Although these simple regexes won't catch other common OCR errors such as two Left Quotation marks in a row.
Quote:
Originally Posted by kovidgoyal
Then count the number of quotes in it, if the number does not match, replace the paragraph with an new paragraph that contains some prominent marker added to the text, like say "FIXME"
|
A simple count Left/Right Double Quotes would be a pretty good approximation... but many of the quotation errors are more nuanced.
Code:
“This is “Tex’s example” error.”
In this case, you have 2 Left Double Quotes, and 2 Right Double Quotes, but the inner ones are wrong. The Outer/Inner quotations must alternate (in English and many other languages, see Wikipedia link below). The inner quotations must be Left/Right Single Quotes (in US English).
Code:
“This is ‘Tex’s fixed example’ error.”
Now you want to check the inner quotes. Whoops, now there is 1 Left Single Quote and 2 Right Single Quotes (one is an apostrophe). You have to have some sort of way that is more intelligent than simple Regex to sift out apostrophes from actual quotation marks.
Also, as Toxaris mentioned, in many cases (novels) there may be a long speech, and there is an opening quotation mark with no closing quotation mark:
Code:
“This is a sample of a very long speech.
“It continues on for a few paragraphs.
“And then it finally finishes here.”
With the simple count method, you would get WAY too many false positives.
Note: As Toxaris mentioned, handling missing quotes due to OCR errors is deceivingly more complicated than it seems. There are variations of quotations in every language:
https://en.wikipedia.org/wiki/Quotation_mark
Some languages open with right quotes + close with left quotes. Some open with low quotes, some close with high quotes. Others have opening/closing quotes facing the same direction... It is really a giant mess. Pretty much every combination imaginable is used in some language.
Quote:
Originally Posted by senhal
Imho, when you're working on an OCR, it's better to verify also false positives 
|
You definitely need manual verification and pointing out EXACTLY where each (potential) error occurs. Having it point out a mismatch at the paragraph-level is just not enough. I don't know what sort of paragraphs you guys work with, but the paragraphs I typically work on are HUGE. :P
Note: I did determine the logic for mismatching parentheses was exactly the same as mismatching quotation marks. I pushed this idea onto Toxaris and he did implement it in his plugin.
I run every single book I work on through his Dialogue Check to double-check quotations/parentheses, and it has caught tens/hundreds that my previous series of Regex methods missed.