Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 04-21-2023, 06:28 AM   #1
SilvioTO
Junior Member
SilvioTO began at the beginning.
 
SilvioTO's Avatar
 
Posts: 7
Karma: 10
Join Date: Nov 2010
Device: none
find dialogues with missing closing inverted commas

I would like to use regex to find the following error in my book:

«dialogue (missing » at the end of phrase).

correct string:
«dialogue»

Tried some code, but I only be able to find a text between « and ».
Thanks for your help.
SilvioTO is offline   Reply With Quote
Old 04-21-2023, 07:49 AM   #2
Turtle91
A Hairy Wizard
Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.
 
Turtle91's Avatar
 
Posts: 3,336
Karma: 20171571
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
You should be able to limit your FIND to a single line using the options drop-down. Then use a negative lookahead to limit it to lines that do NOT have the »

For example:
Find: «(.*?)(?!»)
Replace: «\1»

Last edited by Turtle91; 04-21-2023 at 07:57 AM.
Turtle91 is offline   Reply With Quote
Old 04-21-2023, 09:36 AM   #3
SilvioTO
Junior Member
SilvioTO began at the beginning.
 
SilvioTO's Avatar
 
Posts: 7
Karma: 10
Join Date: Nov 2010
Device: none
In the options of the Find and Replace section at the center bottom of the Sigil window, I cannot find the single line option.
the «(.*?)(?!») code find all phrases that start with « character, with or without »

where i'm wrong?
SilvioTO is offline   Reply With Quote
Old 04-21-2023, 10:47 AM   #4
Turtle91
A Hairy Wizard
Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.
 
Turtle91's Avatar
 
Posts: 3,336
Karma: 20171571
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
Sorry, I’m away from my computer, but IIRC, you want to un-check Dot-All.

From the Users guide:
Quote:
DotAll: This regex option prepends (?s) to all regex searches and is used when you want .* to match any character, even across lines.
Turtle91 is offline   Reply With Quote
Old 04-21-2023, 10:48 AM   #5
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,652
Karma: 5703586
Join Date: Nov 2009
Device: many
Uncheck Regex flag "Dot All" "Dot All" means that a "." will match all characters including a new line character. By unchecking it, you limit wildcard matching to a single line.

FWIW, using that regex and Sigil's replacement table, you should be able to quickly scan all matches for the few problem cases. If not, try a different regex to limit things even further.

Check out the Sigil Users Guide for more info.
KevinH is online now   Reply With Quote
Old 04-22-2023, 05:19 AM   #6
SilvioTO
Junior Member
SilvioTO began at the beginning.
 
SilvioTO's Avatar
 
Posts: 7
Karma: 10
Join Date: Nov 2010
Device: none
Thank you both for your help, you have put me on the right track.
The code «(.*?)(?!») don't work despite I unceck "DotAll" flag, so my next step will be to study regex syntax better on the Sigil manual.

Have a very nice day and... again.
SilvioTO is offline   Reply With Quote
Old 04-22-2023, 08:21 AM   #7
Turtle91
A Hairy Wizard
Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.
 
Turtle91's Avatar
 
Posts: 3,336
Karma: 20171571
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
That should work.

Can you post an example of the code you are searching through so we can see where the issue might be??
Turtle91 is offline   Reply With Quote
Old 04-22-2023, 08:35 PM   #8
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by SilvioTO View Post
I would like to use regex to find the following error in my book:

«dialogue (missing » at the end of phrase).

correct string:
«dialogue»
This is the Regex I used for many years:

Search: (“[^”\r\n]*)</p>

Instead of using LEFT/RIGHT DOUBLE QUOTES, substitute whatever quotations you need for your language.

In your case, you'll use left/right guillemets:

Search: («[^»\r\n]*)</p>

This will catch lines 1+3:

Code:
<p>«This is a test.</p>
<p>And this is more «This is a test.» And more.</p>
<p>Testing. «This is a test.</p>
<p>«This is a test.»</p>
but skip 2+4.

- - -

Side Note: Nowadays, I use Toxaris's EPUB Tools + "Dialogue Check", which is/was the ultimate way to catch all quotation mark errors. I wrote about it in detail back in:

While the pure regex method works in most cases, it will not work on heavily nested (or mismatched) quotation marks.

To catch all left/right or outer/inner quotation mark errors, you definitely need something more smart that:
  • goes forwards/backwards.
    • Catching really hard cases + paragraphs with multiple sets of dialogue.
  • handles false positives like apostrophes vs. SINGLE QUOTES.
    • Saving tons of time.

For a little more info on that, see my post in:

Sadly, Toxaris's EPUB Tools only works in Microsoft Word... and as of a few years ago, Toxaris stopped maintaining it + his site went down.

I did save a copy + attach it to this 2022 post though.

- - -

Quote:
Originally Posted by SilvioTO View Post
Tried some code, but I only be able to find a text between « and ».
Heh, I wrote about that in detail too:

If you ever needed to catch/tag all dialogue in a book for some reason, there's your answer too.

Last edited by Tex2002ans; 04-22-2023 at 09:00 PM.
Tex2002ans is offline   Reply With Quote
Old 04-23-2023, 04:51 AM   #9
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 79,532
Karma: 145863177
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Quote:
Originally Posted by Turtle91 View Post
You should be able to limit your FIND to a single line using the options drop-down. Then use a negative lookahead to limit it to lines that do NOT have the »

For example:
Find: «(.*?)(?!»)
Replace: «\1»
But what if the dialog is something like «Hello» she said.

You would end up with «Hello she said.»
JSWolf is online now   Reply With Quote
Old 04-23-2023, 05:04 AM   #10
SilvioTO
Junior Member
SilvioTO began at the beginning.
 
SilvioTO's Avatar
 
Posts: 7
Karma: 10
Join Date: Nov 2010
Device: none
The solution proposed by Tex2002ans works perfectly and is a great help. However, to be perfect, it should be able to detect errors when two sentences are in the same paragraph. In the following example, only the last absent closed guillemet is detected, but not the first one.

Code:
<p>«Have you been here a long time Terry?» ... «About five days» he replied.</p>
Another solution I found is less sophisticated and requires more visual attention, but, in case of errors, it highlights anomalies in the code that easily catch the eye making it useful:

Code:
«[^»]*»
Trivially it selects all sentences closed by two guillemets, even if they occur within the same paragraph.

Perhaps making two passes of the entire text with the two solutions might eliminate errors altogether.
SilvioTO is offline   Reply With Quote
Old 04-23-2023, 04:59 PM   #11
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by SilvioTO View Post
The solution proposed by Tex2002ans works perfectly and is a great help. However, to be perfect, it should be able to detect errors when two sentences are in the same paragraph.
Then get rid of that final </p>:

Find: («[^»\r\n]*)

This will find each of the open quote, all the way to a closing quote OR end of line:
  • <p>«Have you been here a long time Terry?» ... «About five days» he replied.</p>

and it will catch stuff like:
  • <p>«Have you been here a long time Terry?» ... «About five days he replied.</p>
    • 2nd close missing.
  • <p>«Have you been here a long time Terry? ... «About five days» he replied.</p>
    • 1st close missing.
  • <p>«Have you been here a long time Terry? ... «About five days he replied.</p>
    • 1st AND 2nd close missing.
    • (Although you'd only catch 2nd after correcting the 1st one.)

But now you'll have an absolute TON of false positives to look through...

Quote:
Originally Posted by SilvioTO View Post
Another solution I found is less sophisticated and requires more visual attention, but, in case of errors, it highlights anomalies in the code that easily catch the eye making it useful:
Heh. I'd strongly recommend learning about Sigil's fantastic:
  • Tools > Saved Searches

This lets you create and save multiple Find/Replaces + organize them into Groups.

As described below, you can come up with quite a few sets of "try to catch missing open/close quotes" Regular Expressions.

Those regular expressions alone will carry you like 90% of the way there... but it's that final 10% that's really troublesome.

Quote:
Originally Posted by SilvioTO View Post
Perhaps making two passes of the entire text with the two solutions might eliminate errors altogether.
No. It would require WAY more than 2 sets/passes of Regular Expressions. You have to:
  • Deal with outer/inner quotes separately.
  • Catch 2 lefts in a row.
  • Catch 2 rights in a row.
  • Catch 1 open, missing close.
  • Catch 1 close, missing open.

AND you have to also search forwards/backwards.

Then, you'll want to expand it:
  • Deal with multiple dialogues in a single paragraph.
    • THIS is where Regex is pretty dumb.
    • A program can be in an "ON/OFF state" based on each left/right quote + where it is in the paragraph + which stage it is in the process.
  • (Optional) Deal with long paragraphs of continual dialogue.
    • Common in Fiction.
    • Each paragraph will only start with opening quotes. There is no close quote until the long monologue is completed.

AND you have to deal with (or skip) false positives. (Is this an apostrophe or a single quotation mark?)

AND then you have to deal with all the potential HTML mess in the way too. (<span>, <i>, <em>, class="", [...])

AND then you have to handle inner/outer quotes of all the different languages:
  • « » = French
  • “ ” + ‘ ’ = English (US)
  • ‘ ’ + “ ” = English (UK)
  • „ “ = German

will not work for:
  • » « = Danish
  • ” ” + ’ ’ = Swedish
  • „ ” = Hungarian

(That's not a complete list of steps though, but only what I can quickly think of off the top of my head!)

I went into much more detail in the linked posts/threads...

- - -

Toxaris already solved all those issues in his "Dialogue Check" though.

What you would need is a smart program—not just pure regex—that handles outer/inner quotes, and lets you select between them all as needed.

The logic is effectively the same for all languages, just that the symbols switch.

- - -

And some, like English or Swedish, will have a ton more false positives to look through.

Luckily though, you're using guillemets (French?), which is WAY easier to handle compared to English.

In English, you have:
  • “ ” + ‘ ’

and that 2nd set is the worst, because ’ is used for all sorts of things.
  • “The name of the article I read was ‘Zeus’ Wife’s Example’.”

1 + 4 are the actual inner quotes.

2 + 3 are actually apostrophes.

If all you're doing is checking Left/Right inner quotes, a dumb algorithm will just think you have:
  • 1 LEFT + 3 RIGHT quotes

An algorithm that handles a lot of those edge-cases I mentioned above—plus searching forward/backwards—would catch different errors at different steps.

PLUS it'll minimize the false positives, which is the real time-waster/time-killer.

I explained a lot of this in:

With the pure regex method, you're wasting so much time looking through thousands of correct quotation marks, only to catch that small minority of actual typos/errors/mistakes.

Take it from me... I've probably corrected more quotation mark errors than anyone else on these boards—combined! lol.

Last edited by Tex2002ans; 04-23-2023 at 05:28 PM.
Tex2002ans is offline   Reply With Quote
Old 04-23-2023, 07:23 PM   #12
rkomar
Wizard
rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.
 
Posts: 3,049
Karma: 18821071
Join Date: Oct 2010
Location: Sudbury, ON, Canada
Device: PRS-505, PB 902, PRS-T1, PB 623, PB 840, PB 633
Can you not just search for two opening quotes without a closing quote between them?
rkomar is offline   Reply With Quote
Old 04-23-2023, 10:12 PM   #13
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by rkomar View Post
Can you not just search for two opening quotes without a closing quote between them?
Heh, yes, this will catch some of the punctuation errors.

2 LEFTs in a row:

Find: («[^»\r\n]*)«
  • <p>«Have you been here a long time Terry?» ... «About five days» he replied.</p>
  • <p>«Have you been here a long time Terry?» ... «About five days he replied.</p>
  • <p>«Have you been here a long time Terry? ... «About five days» he replied.</p>
  • <p>«Have you been here a long time Terry? ... «About five days he replied.</p>
  • <p>«Have you been here a long time Terry?« ... «About five days he replied.</p>
    • Wrongly flipped guillemet / OCR error.

So, if you want the pure regex method, you'd create a big ol' collection of "Saved Searches", going from the easy-to-catch stuff—like 2 LEFTs or 2 RIGHTs in a row—all the way down to the hardest-to-catch.

Last edited by Tex2002ans; 04-23-2023 at 10:22 PM.
Tex2002ans is offline   Reply With Quote
Old 04-24-2023, 03:42 AM   #14
SilvioTO
Junior Member
SilvioTO began at the beginning.
 
SilvioTO's Avatar
 
Posts: 7
Karma: 10
Join Date: Nov 2010
Device: none
Thumbs up [SOLVED]

Thank you very much Tex2002ans for the valuable advice and clarity in expounding it.
As for me, I can only say that I have solved the problem.
Thanks again to everyone.
SilvioTO is offline   Reply With Quote
Old 04-24-2023, 09:55 PM   #15
Turtle91
A Hairy Wizard
Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.
 
Turtle91's Avatar
 
Posts: 3,336
Karma: 20171571
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
Edit: This is an answer to Jon's question...not trying to repeat what Tex already answered above...


Quote:
Originally Posted by JSWolf View Post
But what if the dialog is something like «Hello» she said.

You would end up with «Hello she said.»
Sorry It took so long to reply. I was off being a kid in Orlando...

To answer your question, Jon, No. In general, the ? in the capture group forces it to be a minimal match so it would stop capturing at the first legal stopping location. Your example has a closing » so it would not trigger a match.

However, there are two points that I would change based on the discussion above (and my chance to test now that I'm home - which I couldn't do on my phone).
  • Instead of using the negative lookahead on the » I would use a positive lookahead on the </p>
  • I would add the » to a 'not' in the capture group

Code:
«(.[^»]*?)(?=</p>)
That would capture anything preceded by a « that doesn't have a » followed by a </p>, without capturing the </p>(eg. just the text).

You are correct that the «\1» would place guillemets around the entire captured text, but this process was never intended to be completely automatic. There is no way for the Find/Replace to discern what is dialogue and what is not. If you use the "Replacements (Delete Unwanted Replacements)" table (see attached image) you would quickly be able to see which sections have an opening « with no closing » highlighted - with what it would look like After the replace. As KevinH mentioned above, you could quickly remove any lines that didn't match the change you wanted...apply changes...then run the find again with a different replace criteria.

ie. the above would not find "<p>«Hello she said. «Hello she said.»</p>" until you ran a different Find with the « in place of the look ahead </p> Like: «(.[^»</p>]*?)«

That would probably be the fastest (least # of iterations) way to search through the text to find those types of errors.
Attached Thumbnails
Click image for larger version

Name:	Screenshot 2023-04-24 213934.png
Views:	170
Size:	164.0 KB
ID:	201268  

Last edited by Turtle91; 04-24-2023 at 10:08 PM.
Turtle91 is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
find missing classes kcarscadden Editor 4 12-10-2019 08:16 PM
Finding missing Oxford Commas avantman42 Writers' Corner 6 07-20-2013 03:29 AM
Missing Commas & Full Stops Paxman53 Sigil 5 01-09-2013 12:53 PM
find replace - does it auto-fix closing tqags ??? cybmole Sigil 6 01-19-2011 02:32 PM
close "inverted commas" alone on one line GillianMary Workshop 5 10-08-2010 01:09 PM


All times are GMT -4. The time now is 09:41 AM.


MobileRead.com is a privately owned, operated and funded community.