Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Editor

Notices

Reply
 
Thread Tools Search this Thread
Old 03-26-2016, 04:54 AM   #1
senhal
Connoisseur
senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.
 
senhal's Avatar
 
Posts: 80
Karma: 25684
Join Date: Sep 2014
Device: Kindle NT
Regex Function about «» and “”

I haven't studied yet how Regex Functions work, so I don't know if my idea can be translated into a function.
I'd like to find paragraphs in which the number of « and » don't match. This can be useful to find some OCR typical errors regarding dialogues (« not closed or viceversa, « or » wrongly recognized...)
The same for “ and ”.
Is this possibile?
Can someone write something like this?

senhal is offline   Reply With Quote
Old 03-26-2016, 06:56 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Sure, create a regex that matches from the opening <p to the closing </p>. Then count the number of quotes in it, if the number does not match, replace the paragraph with an new paragraph that contains some prominent marker added to the text, like say "FIXME"

Then just search the book for all occurrences of FIXME.
kovidgoyal is online now   Reply With Quote
Advert
Old 03-26-2016, 08:57 AM   #3
Toxaris
Wizard
Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.
 
Toxaris's Avatar
 
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
It might be that it can be done via RegEx, but I think that a plugin would be better suited.
The suggestion from Kovid will not handle dialogue that spans more than one paragraph. Not only that, but depending on the language specifications it might even be that in case of multiple paragraphs that the next paragraph will start with another opening/closing quotation mark without closing the previous first. That is according to the style guide and will result in false hits. There is more to this than it seems...

I have a function like this in my Word add-in and that one does not rely on RegEx at all. The add-in probably doesn't help you here, since that required Word and does not work in Calibre.
Toxaris is offline   Reply With Quote
Old 03-26-2016, 09:06 AM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
The OP did say he specifically wanted to count the number of quotes *in a paragraph*.
kovidgoyal is online now   Reply With Quote
Old 03-26-2016, 11:46 AM   #5
senhal
Connoisseur
senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.
 
senhal's Avatar
 
Posts: 80
Karma: 25684
Join Date: Sep 2014
Device: Kindle NT
Quote:
Originally Posted by Toxaris View Post
The suggestion from Kovid will not handle dialogue that spans more than one paragraph. Not only that, but depending on the language specifications it might even be that in case of multiple paragraphs that the next paragraph will start with another opening/closing quotation mark without closing the previous first. That is according to the style guide and will result in false hits. There is more to this than it seems...
Imho, when you're working on an OCR, it's better to verify also false positives
I'm not able to realize the function, I hope somone is interested in it. Else, I'll try...
Thanks!
senhal is offline   Reply With Quote
Advert
Old 03-26-2016, 02:02 PM   #6
Toxaris
Wizard
Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.
 
Toxaris's Avatar
 
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
Quote:
Originally Posted by senhal View Post
Imho, when you're working on an OCR, it's better to verify also false positives
I'm not able to realize the function, I hope somone is interested in it. Else, I'll try...
Thanks!
I agree. In the procedure you always have to verify the cases where an issue arises. The most probable option is given for quick selection, but manual correction is possible. Anyway, like I said, it will not help you here since it is not for Calibre. I just reacted to let you know it can be quite complex.
Toxaris is offline   Reply With Quote
Old 03-31-2016, 12:13 AM   #7
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by senhal View Post
I'd like to find paragraphs in which the number of « and » don't match. This can be useful to find some OCR typical errors regarding dialogues (« not closed or viceversa, « or » wrongly recognized...)
For a very simple check, I used to use this Regex along these lines to search for opening quotes with no closing quotes:

Search: (“[^”]*)</p>

You could probably just substitute this in to catch some opening guillemet with no closing guillemet:

Search: («[^»]*)</p>

Although these simple regexes won't catch other common OCR errors such as two Left Quotation marks in a row.

Quote:
Originally Posted by kovidgoyal View Post
Then count the number of quotes in it, if the number does not match, replace the paragraph with an new paragraph that contains some prominent marker added to the text, like say "FIXME"
A simple count Left/Right Double Quotes would be a pretty good approximation... but many of the quotation errors are more nuanced.

Code:
“This is “Tex’s example” error.”
In this case, you have 2 Left Double Quotes, and 2 Right Double Quotes, but the inner ones are wrong. The Outer/Inner quotations must alternate (in English and many other languages, see Wikipedia link below). The inner quotations must be Left/Right Single Quotes (in US English).

Code:
“This is ‘Tex’s fixed example’ error.”
Now you want to check the inner quotes. Whoops, now there is 1 Left Single Quote and 2 Right Single Quotes (one is an apostrophe). You have to have some sort of way that is more intelligent than simple Regex to sift out apostrophes from actual quotation marks.

Also, as Toxaris mentioned, in many cases (novels) there may be a long speech, and there is an opening quotation mark with no closing quotation mark:

Code:
“This is a sample of a very long speech.

“It continues on for a few paragraphs.

“And then it finally finishes here.”
With the simple count method, you would get WAY too many false positives.

Note: As Toxaris mentioned, handling missing quotes due to OCR errors is deceivingly more complicated than it seems. There are variations of quotations in every language:

https://en.wikipedia.org/wiki/Quotation_mark

Some languages open with right quotes + close with left quotes. Some open with low quotes, some close with high quotes. Others have opening/closing quotes facing the same direction... It is really a giant mess. Pretty much every combination imaginable is used in some language.

Quote:
Originally Posted by senhal View Post
Imho, when you're working on an OCR, it's better to verify also false positives
You definitely need manual verification and pointing out EXACTLY where each (potential) error occurs. Having it point out a mismatch at the paragraph-level is just not enough. I don't know what sort of paragraphs you guys work with, but the paragraphs I typically work on are HUGE. :P

Note: I did determine the logic for mismatching parentheses was exactly the same as mismatching quotation marks. I pushed this idea onto Toxaris and he did implement it in his plugin.

I run every single book I work on through his Dialogue Check to double-check quotations/parentheses, and it has caught tens/hundreds that my previous series of Regex methods missed.

Last edited by Tex2002ans; 03-31-2016 at 12:20 AM.
Tex2002ans is offline   Reply With Quote
Old 03-31-2016, 01:36 AM   #8
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,572
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Tex2002ans on use of Tox' Dialogue Checker and to him for telling me about it. Is there anything out there, short of Deep Blue or Alpha Go, that will detect this sort of error?

"Golly, it's a beautiful day. I think I'll go down to the beach for the rest of the day, said Dan as he leapt up from the couch. Mary looked at Dan aghast. You been in VR too long baby, it's 20 below outside and we're in Brownlee, Nebraska."

BR
BetterRed is offline   Reply With Quote
Old 04-06-2016, 02:12 AM   #9
senhal
Connoisseur
senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.
 
senhal's Avatar
 
Posts: 80
Karma: 25684
Join Date: Sep 2014
Device: Kindle NT
Quote:
Originally Posted by Tex2002ans View Post
For a very simple check, I used to use this Regex along these lines to search for opening quotes with no closing quotes
Thanks
I'm still using this regex for opening quotes with no closing ones:
Code:
(?<=«[^»]*)</p>\s*<p[^>]*>(?!«)
And viceversa:
Code:
(<p[^>]*>[^«]*([»])(.*)?<[^p>]*/p>)
senhal is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
regex-function convert roman numerals weberr Editor 11 09-22-2021 05:15 PM
RegEx Function: Title Case phossler Editor 29 07-04-2020 10:52 AM
A regex function to number a mathematical ebook dmonasse Editor 3 12-23-2014 02:54 AM
Regex Function - Split unknown word Paulie_D Editor 19 12-07-2014 05:12 AM
function re() myki Library Management 6 06-23-2014 05:11 PM


All times are GMT -4. The time now is 09:42 AM.


MobileRead.com is a privately owned, operated and funded community.