MobileRead Forums

MobileRead Forums (https://www.mobileread.com/forums/index.php)
-   Editor (https://www.mobileread.com/forums/forumdisplay.php?f=262)
-   -   Regex Function about «» and “” (https://www.mobileread.com/forums/showthread.php?t=272400)

senhal 03-26-2016 05:54 AM

Regex Function about «» and “”
 
I haven't studied yet how Regex Functions work, so I don't know if my idea can be translated into a function.
I'd like to find paragraphs in which the number of « and » don't match. This can be useful to find some OCR typical errors regarding dialogues (« not closed or viceversa, « or » wrongly recognized...)
The same for “ and ”.
Is this possibile?
Can someone write something like this?

:thanks:

kovidgoyal 03-26-2016 07:56 AM

Sure, create a regex that matches from the opening <p to the closing </p>. Then count the number of quotes in it, if the number does not match, replace the paragraph with an new paragraph that contains some prominent marker added to the text, like say "FIXME"

Then just search the book for all occurrences of FIXME.

Toxaris 03-26-2016 09:57 AM

It might be that it can be done via RegEx, but I think that a plugin would be better suited.
The suggestion from Kovid will not handle dialogue that spans more than one paragraph. Not only that, but depending on the language specifications it might even be that in case of multiple paragraphs that the next paragraph will start with another opening/closing quotation mark without closing the previous first. That is according to the style guide and will result in false hits. There is more to this than it seems...

I have a function like this in my Word add-in and that one does not rely on RegEx at all. The add-in probably doesn't help you here, since that required Word and does not work in Calibre.

kovidgoyal 03-26-2016 10:06 AM

The OP did say he specifically wanted to count the number of quotes *in a paragraph*.

senhal 03-26-2016 12:46 PM

Quote:

Originally Posted by Toxaris (Post 3287651)
The suggestion from Kovid will not handle dialogue that spans more than one paragraph. Not only that, but depending on the language specifications it might even be that in case of multiple paragraphs that the next paragraph will start with another opening/closing quotation mark without closing the previous first. That is according to the style guide and will result in false hits. There is more to this than it seems...

Imho, when you're working on an OCR, it's better to verify also false positives :)
I'm not able to realize the function, I hope somone is interested in it. Else, I'll try...
Thanks!

Toxaris 03-26-2016 03:02 PM

Quote:

Originally Posted by senhal (Post 3287714)
Imho, when you're working on an OCR, it's better to verify also false positives :)
I'm not able to realize the function, I hope somone is interested in it. Else, I'll try...
Thanks!

I agree. In the procedure you always have to verify the cases where an issue arises. The most probable option is given for quick selection, but manual correction is possible. Anyway, like I said, it will not help you here since it is not for Calibre. I just reacted to let you know it can be quite complex.

Tex2002ans 03-31-2016 01:13 AM

Quote:

Originally Posted by senhal (Post 3287591)
I'd like to find paragraphs in which the number of « and » don't match. This can be useful to find some OCR typical errors regarding dialogues (« not closed or viceversa, « or » wrongly recognized...)

For a very simple check, I used to use this Regex along these lines to search for opening quotes with no closing quotes:

Search: (“[^”]*)</p>

You could probably just substitute this in to catch some opening guillemet with no closing guillemet:

Search: («[^»]*)</p>

Although these simple regexes won't catch other common OCR errors such as two Left Quotation marks in a row.

Quote:

Originally Posted by kovidgoyal (Post 3287622)
Then count the number of quotes in it, if the number does not match, replace the paragraph with an new paragraph that contains some prominent marker added to the text, like say "FIXME"

A simple count Left/Right Double Quotes would be a pretty good approximation... but many of the quotation errors are more nuanced.

Code:

“This is “Tex’s example” error.”
In this case, you have 2 Left Double Quotes, and 2 Right Double Quotes, but the inner ones are wrong. The Outer/Inner quotations must alternate (in English and many other languages, see Wikipedia link below). The inner quotations must be Left/Right Single Quotes (in US English).

Code:

“This is ‘Tex’s fixed example’ error.”
Now you want to check the inner quotes. Whoops, now there is 1 Left Single Quote and 2 Right Single Quotes (one is an apostrophe). You have to have some sort of way that is more intelligent than simple Regex to sift out apostrophes from actual quotation marks.

Also, as Toxaris mentioned, in many cases (novels) there may be a long speech, and there is an opening quotation mark with no closing quotation mark:

Code:

“This is a sample of a very long speech.

“It continues on for a few paragraphs.

“And then it finally finishes here.”

With the simple count method, you would get WAY too many false positives.

Note: As Toxaris mentioned, handling missing quotes due to OCR errors is deceivingly more complicated than it seems. There are variations of quotations in every language:

https://en.wikipedia.org/wiki/Quotation_mark

Some languages open with right quotes + close with left quotes. Some open with low quotes, some close with high quotes. Others have opening/closing quotes facing the same direction... It is really a giant mess. Pretty much every combination imaginable is used in some language. :D

Quote:

Originally Posted by senhal (Post 3287714)
Imho, when you're working on an OCR, it's better to verify also false positives :)

You definitely need manual verification and pointing out EXACTLY where each (potential) error occurs. Having it point out a mismatch at the paragraph-level is just not enough. I don't know what sort of paragraphs you guys work with, but the paragraphs I typically work on are HUGE. :P

Note: I did determine the logic for mismatching parentheses was exactly the same as mismatching quotation marks. I pushed this idea onto Toxaris and he did implement it in his plugin.

I run every single book I work on through his Dialogue Check to double-check quotations/parentheses, and it has caught tens/hundreds that my previous series of Regex methods missed.

BetterRed 03-31-2016 02:36 AM

:ditto: Tex2002ans on use of Tox' Dialogue Checker and :hatsoff: to him for telling me about it. Is there anything out there, short of Deep Blue or Alpha Go, that will detect this sort of error?

"Golly, it's a beautiful day. I think I'll go down to the beach for the rest of the day, said Dan as he leapt up from the couch. Mary looked at Dan aghast. You been in VR too long baby, it's 20 below outside and we're in Brownlee, Nebraska."

BR

senhal 04-06-2016 03:12 AM

Quote:

Originally Posted by Tex2002ans (Post 3290012)
For a very simple check, I used to use this Regex along these lines to search for opening quotes with no closing quotes

Thanks :)
I'm still using this regex for opening quotes with no closing ones:
Code:

(?<=«[^»]*)</p>\s*<p[^>]*>(?!«)
And viceversa:
Code:

(<p[^>]*>[^«]*([»])(.*)?<[^p>]*/p>)


All times are GMT -4. The time now is 06:57 PM.

Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.