View Single Post
Old 03-31-2016, 12:13 AM   #7
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by senhal View Post
I'd like to find paragraphs in which the number of « and » don't match. This can be useful to find some OCR typical errors regarding dialogues (« not closed or viceversa, « or » wrongly recognized...)
For a very simple check, I used to use this Regex along these lines to search for opening quotes with no closing quotes:

Search: (“[^”]*)</p>

You could probably just substitute this in to catch some opening guillemet with no closing guillemet:

Search: («[^»]*)</p>

Although these simple regexes won't catch other common OCR errors such as two Left Quotation marks in a row.

Quote:
Originally Posted by kovidgoyal View Post
Then count the number of quotes in it, if the number does not match, replace the paragraph with an new paragraph that contains some prominent marker added to the text, like say "FIXME"
A simple count Left/Right Double Quotes would be a pretty good approximation... but many of the quotation errors are more nuanced.

Code:
“This is “Tex’s example” error.”
In this case, you have 2 Left Double Quotes, and 2 Right Double Quotes, but the inner ones are wrong. The Outer/Inner quotations must alternate (in English and many other languages, see Wikipedia link below). The inner quotations must be Left/Right Single Quotes (in US English).

Code:
“This is ‘Tex’s fixed example’ error.”
Now you want to check the inner quotes. Whoops, now there is 1 Left Single Quote and 2 Right Single Quotes (one is an apostrophe). You have to have some sort of way that is more intelligent than simple Regex to sift out apostrophes from actual quotation marks.

Also, as Toxaris mentioned, in many cases (novels) there may be a long speech, and there is an opening quotation mark with no closing quotation mark:

Code:
“This is a sample of a very long speech.

“It continues on for a few paragraphs.

“And then it finally finishes here.”
With the simple count method, you would get WAY too many false positives.

Note: As Toxaris mentioned, handling missing quotes due to OCR errors is deceivingly more complicated than it seems. There are variations of quotations in every language:

https://en.wikipedia.org/wiki/Quotation_mark

Some languages open with right quotes + close with left quotes. Some open with low quotes, some close with high quotes. Others have opening/closing quotes facing the same direction... It is really a giant mess. Pretty much every combination imaginable is used in some language.

Quote:
Originally Posted by senhal View Post
Imho, when you're working on an OCR, it's better to verify also false positives
You definitely need manual verification and pointing out EXACTLY where each (potential) error occurs. Having it point out a mismatch at the paragraph-level is just not enough. I don't know what sort of paragraphs you guys work with, but the paragraphs I typically work on are HUGE. :P

Note: I did determine the logic for mismatching parentheses was exactly the same as mismatching quotation marks. I pushed this idea onto Toxaris and he did implement it in his plugin.

I run every single book I work on through his Dialogue Check to double-check quotations/parentheses, and it has caught tens/hundreds that my previous series of Regex methods missed.

Last edited by Tex2002ans; 03-31-2016 at 12:20 AM.
Tex2002ans is offline   Reply With Quote