03-26-2016, 04:54 AM | #1 |
Connoisseur
Posts: 80
Karma: 25684
Join Date: Sep 2014
Device: Kindle NT
|
Regex Function about «» and “”
I haven't studied yet how Regex Functions work, so I don't know if my idea can be translated into a function.
I'd like to find paragraphs in which the number of « and » don't match. This can be useful to find some OCR typical errors regarding dialogues (« not closed or viceversa, « or » wrongly recognized...) The same for “ and ”. Is this possibile? Can someone write something like this? |
03-26-2016, 06:56 AM | #2 |
creator of calibre
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Sure, create a regex that matches from the opening <p to the closing </p>. Then count the number of quotes in it, if the number does not match, replace the paragraph with an new paragraph that contains some prominent marker added to the text, like say "FIXME"
Then just search the book for all occurrences of FIXME. |
Advert | |
|
03-26-2016, 08:57 AM | #3 |
Wizard
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
It might be that it can be done via RegEx, but I think that a plugin would be better suited.
The suggestion from Kovid will not handle dialogue that spans more than one paragraph. Not only that, but depending on the language specifications it might even be that in case of multiple paragraphs that the next paragraph will start with another opening/closing quotation mark without closing the previous first. That is according to the style guide and will result in false hits. There is more to this than it seems... I have a function like this in my Word add-in and that one does not rely on RegEx at all. The add-in probably doesn't help you here, since that required Word and does not work in Calibre. |
03-26-2016, 09:06 AM | #4 |
creator of calibre
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
The OP did say he specifically wanted to count the number of quotes *in a paragraph*.
|
03-26-2016, 11:46 AM | #5 | |
Connoisseur
Posts: 80
Karma: 25684
Join Date: Sep 2014
Device: Kindle NT
|
Quote:
I'm not able to realize the function, I hope somone is interested in it. Else, I'll try... Thanks! |
|
Advert | |
|
03-26-2016, 02:02 PM | #6 |
Wizard
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
I agree. In the procedure you always have to verify the cases where an issue arises. The most probable option is given for quick selection, but manual correction is possible. Anyway, like I said, it will not help you here since it is not for Calibre. I just reacted to let you know it can be quite complex.
|
03-31-2016, 12:13 AM | #7 | |||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Search: (“[^”]*)</p> You could probably just substitute this in to catch some opening guillemet with no closing guillemet: Search: («[^»]*)</p> Although these simple regexes won't catch other common OCR errors such as two Left Quotation marks in a row. Quote:
Code:
“This is “Tex’s example” error.” Code:
“This is ‘Tex’s fixed example’ error.” Also, as Toxaris mentioned, in many cases (novels) there may be a long speech, and there is an opening quotation mark with no closing quotation mark: Code:
“This is a sample of a very long speech. “It continues on for a few paragraphs. “And then it finally finishes here.” Note: As Toxaris mentioned, handling missing quotes due to OCR errors is deceivingly more complicated than it seems. There are variations of quotations in every language: https://en.wikipedia.org/wiki/Quotation_mark Some languages open with right quotes + close with left quotes. Some open with low quotes, some close with high quotes. Others have opening/closing quotes facing the same direction... It is really a giant mess. Pretty much every combination imaginable is used in some language. Quote:
Note: I did determine the logic for mismatching parentheses was exactly the same as mismatching quotation marks. I pushed this idea onto Toxaris and he did implement it in his plugin. I run every single book I work on through his Dialogue Check to double-check quotations/parentheses, and it has caught tens/hundreds that my previous series of Regex methods missed. Last edited by Tex2002ans; 03-31-2016 at 12:20 AM. |
|||
03-31-2016, 01:36 AM | #8 |
null operator (he/him)
Posts: 20,572
Karma: 26954694
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
Tex2002ans on use of Tox' Dialogue Checker and to him for telling me about it. Is there anything out there, short of Deep Blue or Alpha Go, that will detect this sort of error?
"Golly, it's a beautiful day. I think I'll go down to the beach for the rest of the day, said Dan as he leapt up from the couch. Mary looked at Dan aghast. You been in VR too long baby, it's 20 below outside and we're in Brownlee, Nebraska." BR |
04-06-2016, 02:12 AM | #9 | |
Connoisseur
Posts: 80
Karma: 25684
Join Date: Sep 2014
Device: Kindle NT
|
Quote:
I'm still using this regex for opening quotes with no closing ones: Code:
(?<=«[^»]*)</p>\s*<p[^>]*>(?!«) Code:
(<p[^>]*>[^«]*([»])(.*)?<[^p>]*/p>) |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
regex-function convert roman numerals | weberr | Editor | 11 | 09-22-2021 05:15 PM |
RegEx Function: Title Case | phossler | Editor | 29 | 07-04-2020 10:52 AM |
A regex function to number a mathematical ebook | dmonasse | Editor | 3 | 12-23-2014 02:54 AM |
Regex Function - Split unknown word | Paulie_D | Editor | 19 | 12-07-2014 05:12 AM |
function re() | myki | Library Management | 6 | 06-23-2014 05:11 PM |