Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 11-15-2023, 05:03 PM   #1
jwes
Enthusiast
jwes began at the beginning.
 
Posts: 39
Karma: 10
Join Date: Jul 2023
Device: none
Problems with regex and text flag

I was using this regex: "(.*?)"
the quotes are part of the regex
on this p

<p class="msonormal3">“Ah,” Pen Rel said again, and inclined his head. "Mostly, it is a matter of temperature control.<br class="calibre13"/>How much simpler, after all, to let the wandering air take the heat away than to condition the dock entire.”</p>

and the found text was:

"Mostly, it is a matter of temperature control.<br class="

Also ^ and $ don't work how I would expect. They seem to match only after and before newlines in text rather than the start and end of paragraph text.

I also find it very difficult to distinguish between “, ", and ” in the find and replace boxes.

Last edited by jwes; 11-15-2023 at 05:20 PM.
jwes is offline   Reply With Quote
Old 11-15-2023, 06:01 PM   #2
DNSB
Bibliophagist
DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.
 
DNSB's Avatar
 
Posts: 35,464
Karma: 145525534
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Forma, Clara HD, Lenovo M8 FHD, Paperwhite 4, Tolino epos
One item is that you have a mix of curly quotes around the “Ah,” for example) and straight quotes around the "Mostly, it is a matter of temperature control.<br class=" for instance. If I modify the quotes to curly quotes for the content:

Code:
<p class="msonormal3">“Ah,” Pen Rel said again, and inclined his head. “Mostly, it is a matter of temperature control.<br class="calibre13"/>How much simpler, after all, to let the wandering air take the heat away than to condition the dock entire.”</p>
and then run the regex as “.*?”, it finds two items, “Ah,” and “Mostly, it is a matter of temperature control.<br class="calibre13"/>How much simpler, after all, to let the wandering air take the heat away than to condition the dock entire.” which is more or less correct.

You might want to try smartening the punctuation.

BTW, using a <br> to break in the middle of a paragraph is a bad idea. Let the renderer break the lines where it needs to.
DNSB is offline   Reply With Quote
Advert
Old 11-15-2023, 06:31 PM   #3
jwes
Enthusiast
jwes began at the beginning.
 
Posts: 39
Karma: 10
Join Date: Jul 2023
Device: none
Quote:
Originally Posted by DNSB View Post
One item is that you have a mix of curly quotes around the “Ah,” for example) and straight quotes around the "Mostly, it is a matter of temperature control.<br class=" for instance. If I modify the quotes to curly quotes for the content:

Code:
<p class="msonormal3">“Ah,” Pen Rel said again, and inclined his head. “Mostly, it is a matter of temperature control.<br class="calibre13"/>How much simpler, after all, to let the wandering air take the heat away than to condition the dock entire.”</p>
and then run the regex as “.*?”, it finds two items, “Ah,” and “Mostly, it is a matter of temperature control.<br class="calibre13"/>How much simpler, after all, to let the wandering air take the heat away than to condition the dock entire.” which is more or less correct.

You might want to try smartening the punctuation.
That's what I was doing. They were all straight quotes when I started.

Quote:
BTW, using a <br> to break in the middle of a paragraph is a bad idea. Let the renderer break the lines where it needs to.
This is a Calibre conversion, and Calibre does things like that.
jwes is offline   Reply With Quote
Old 11-15-2023, 09:49 PM   #4
DNSB
Bibliophagist
DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.
 
DNSB's Avatar
 
Posts: 35,464
Karma: 145525534
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Forma, Clara HD, Lenovo M8 FHD, Paperwhite 4, Tolino epos
Quote:
Originally Posted by DNSB View Post
You might want to try smartening the punctuation.

BTW, using a <br> to break in the middle of a paragraph is a bad idea. Let the renderer break the lines where it needs to.
Quote:
Originally Posted by jwes View Post
That's what I was doing. They were all straight quotes when I started.

This is a Calibre conversion, and Calibre does things like that.
Not sure if that is a calibre artifact, IMHO, it's a lot more likely that that was in the original document. Since you are editing the file, it's not that big a deal to remove those embedded <br>s.

For the most part, my first pass at smartening punctuation is using the Modify Epub calibre plugin.

Last edited by DNSB; 11-15-2023 at 09:52 PM.
DNSB is offline   Reply With Quote
Old 11-16-2023, 06:13 AM   #5
Turtle91
A Hairy Wizard
Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.
 
Turtle91's Avatar
 
Posts: 3,095
Karma: 18727053
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
Or, since you are using Sigil, the Smarten Punctuation plug-in.
Turtle91 is offline   Reply With Quote
Advert
Old 11-16-2023, 03:59 PM   #6
jwes
Enthusiast
jwes began at the beginning.
 
Posts: 39
Karma: 10
Join Date: Jul 2023
Device: none
Quote:
Originally Posted by Turtle91 View Post
Or, since you are using Sigil, the Smarten Punctuation plug-in.
Thanks, that is a good suggestion, but I'm mostly concerned that a regex replace with the text flag set can break the html.
jwes is offline   Reply With Quote
Old 11-16-2023, 04:46 PM   #7
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,647
Karma: 5433388
Join Date: Nov 2009
Device: many
The text flag is nothing more than the automatic prepending of the following regex:

static const QString REGEX_OPTION_TEXT_ONLY = "<[^<>]*>(*SKIP)(*F)|";

which can be overruled by later regex you add.

If should only match text outside of < > chars unless overruled by later regex.

Try turning off text and prepending it yourself to try to see what is interfering.
KevinH is online now   Reply With Quote
Old 11-16-2023, 06:13 PM   #8
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,647
Karma: 5433388
Join Date: Nov 2009
Device: many
Ah! Your use of ? to indicate Minimal Match (non-greedy) will actually *invert* the Greediness of the Minimal Match Regex Option. The Text flag needs its part to default to greedy.

Try replacing the smart quote at the end of "entire." with a normal quote and then remove the ? that toggles the initial Text only regex to be nongreedy.

ie. use "(.*)"

and then make sure the Minimal Match and DotAll are both set in the Regex options and make sure the Text box is checked.

That seems to work.

But using a real parser or the Smarten plugin is probably your best bet here as corner cases will be found.

If you do decide to use Search and replace and regex, you should first do a Dry Run using Shift key on the Count (#) button or better yet use Shift on the Replace All button to see a complete table of the potential replacements and allow you to remove any corner cases (filter those changes out) before proceeding.

Both Dry Run Replace All and Filtered Replace All are newer Find and Replace tools that really help in situations where unspecified corner cases may exist.

Give them a try.

Additionally, making a Checkpoint would not hurt either.

Last edited by KevinH; 11-16-2023 at 06:39 PM.
KevinH is online now   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Regex for Marking Text? Turtle91 Sigil 13 09-18-2018 05:30 PM
regex parenthesis text formatting question! ksimpson1986 Sigil 3 11-10-2016 01:54 AM
Regex questions (body of text only?) rosshalde Sigil 3 10-23-2014 09:02 PM
Is there a way to remove text from Title with regex LadyKate Library Management 8 02-14-2014 04:12 PM
Is there RegEx to <span> ALL CAPS text? phossler Sigil 4 03-10-2013 02:43 PM


All times are GMT -4. The time now is 12:51 PM.


MobileRead.com is a privately owned, operated and funded community.