MobileRead Forums

MobileRead Forums (https://www.mobileread.com/forums/index.php)
-   Editor (https://www.mobileread.com/forums/forumdisplay.php?f=262)
-   -   Help creating possible Regex-Function (https://www.mobileread.com/forums/showthread.php?t=327750)

MerlinMama 03-01-2020 08:32 AM

Help creating possible Regex-Function
 
**If this shouldn't be here, please move or delete, and you have my apologies**

:help:

I have been trying to understand Python to create my own regex-functions, but even after a year, I'm clueless. I hope that someone can help me create one...or tell me if what I want is even possible.

I have a very long, created text, where the author included a lot of sections where each paragraph is wrapped in tags for italics. That's fine. But they also wrapped all the sections in a tag which automatically makes those paragraphs italic.

I have been trying regular search and replace expressions, but I can't get anything that works whether there is one paragraph or eight paragraphs to remove the italics tags from. It's either too greedy, or not greedy enough.

I would like help to do the following:
  1. select and mark all text between the two tags <form></form>
  2. delete the italics tags from the beginnings and endings of each paragraph
  3. if possible (probably not, but...), change the italics tags INSIDE each paragraph to bold tags

In any case, I'd appreciate being directed to an online tutorial type of place that would be easy enough for me to understand (maybe easier than "Python for Dummies" at the rate I'm going :rofl:) so I can eventually learn to do it myself.

If someone would prefer to help off-forum - messages - I don't mind that either.

stumped 03-01-2020 09:10 AM

do it in two passes.
1 remove italic tags from paragraphs
then
2 remove the section tags

if you post a sample section it will be easier to help

e.g. to remove the form tags
find <form>(.*)</form)
replace with \1

MerlinMama 03-01-2020 09:43 AM

Quote:

Originally Posted by stumped (Post 3959173)
do it in two passes.
1 remove italic tags from paragraphs
then
2 remove the section tags

if you post a sample section it will be easier to help

e.g. to remove the form tags
find <form>(.*)</form)
replace with \1

Oh, I want to keep the section tags, just remove the italics from within the section tags (I have over 100 different sections in the text, and they can have either just one paragraph or many paragraphs).

Here's an example:
Spoiler:

Start with:

<form>
<p><em>Blah, blah, blah.</em></p>
<p><em>"Blah, blah-blah, blaaah."</em></p>
</form>

End with:

<form>
<p>Blah, blah, blah.</p>
<p>"Blah, blah-blah, blaaah."</p>
</form>


And more difficult, but if possible:
Spoiler:

Start with:

<form>
<p><em>Blah, blah, blah.</em></p>
<p><em>"Blah, </em>blah-blah<em>, blaaah."<em></p>
</form>

End with:

<form>
<p>Blah, blah, blah.</p>
<p>"Blah, <strong>blah-blah</strong>, blaaah."</p>
</form>

stumped 03-01-2020 10:04 AM

ok for the 1st one
find
<p><em>(.*)</em></p>
replace <p>\1</p>
for the 2nd one, use 2 passes - first remove the not-needed inner bits
which start with a close em tag followed by an open em tag:

find </em>(.*)<em>
replace \1
then use a 2nd pass to change em to strong
the trick is to use several simple expressions not one very complicated one, and review results after each stage.
make a backup before risking a replace all
if you have the patience, step through using find replace to do single operations and then move on to the next candidate, that way you can skip past any o you want to leave unchanged

NB I do all this using sigil - syntax may be different for other tools

theducks 03-01-2020 10:45 AM

If you were doing ALL italic, I would suggest the Edit Spans and Divs plugin.
Don't believe the name. it does many more tag types...ONE type at a time (It is Diaps toolbag in Sigil)

:2thumbsup For the stop trying to do all in ONE PASS. All you do is give Murphy a leg up :rolleyes:

(.*?) reduces the greedyness of th (.*)

<p><i>What!</i> will happen if this <i>code appears?</i></p> :eek:

MerlinMama 03-01-2020 12:13 PM

I think I am misunderstanding you. From what I understand, your suggestions would remove the <em></em> tags, regardless if they are found between the <form></form> tags, which is NOT what I want. Anything outside the <form></form> tags should stay as is. The <form> tag contains formatting which has italics in it, so the addition of <em> tags are unnecessary.

I'll insert another, maybe better, example for you to comment on so I can understand. Although I'm starting to think it can't be done, or I'm just missing something obvious.

I'll mark tags I want to keep in blue, tags I don't want to keep in red (only those I'm asking about; <p> tags I won't touch.)

Spoiler:

<p>What if he was <em>right outside</em> the door? What if he came back into the room?</p>

<p class="centered">oOOo</p>

<form>
<p><em>The creature crept around the room, either missing or ignoring the small boy huddled beneath the covers on the bed. His regular blankets were placed normally, while Kevin was wrapped in the ratty black blanket he found in the treehouse in the woods.</em></p>
<p><em>He watched as the large form ambled to the door and slipped out, allowing the door to ease shut behind it. Kevin had </em>never<em> felt so relieved in his short life.</em></p>
</form>

<p class="centered">oOOo</p>

<p><em>He felt a tear slip from his eye as he came out of the memory. </em></p>

<p class="centered">oOOo</p>

<form>
<p><em>"You'd best go right to sleep, Kevin," Nana scolded. "If you don't, the Great Wolf will come in and eat us all up."</em></p>
<p><em>Poppy frowned. "Don't scare the boy, you old crone." He ushered her out and shut the door behind them.</em>
<p><em>Nana was so silly, there was no such thing as a Great Wolf that ate people, Daddy said so, Kevin thought.</em>
</form>

<p class="centered">oOOo</p>

I don't mind doing multiple passes, but as it is, I haven't been able to do anything except check each one almost individually. That's why I though that creating a Regex-Function was the way to go.

Ideally, I had hoped to be able to have something that says: "change <p><em></em></p> to just <p></p> when between <form></form> tags". I wouldn't even mind if it was "remove all <em></em> tags when between <form></form> tags".

stumped 03-01-2020 12:33 PM

Quote:

Originally Posted by MerlinMama (Post 3959222)
I think I am misunderstanding you. From what I understand, your suggestions would remove the <em></em> tags, regardless if they are found between the <form></form> tags, which is NOT what I want. Anything outside the <form></form> tags should stay as is. The <form> tag contains formatting which has italics in it, so the addition of <em> tags are unnecessary.

I'll insert another, maybe better, example for you to comment on so I can understand. Although I'm starting to think it can't be done, or I'm just missing something obvious.

I'll mark tags I want to keep in blue, tags I don't want to keep in red (only those I'm asking about; <p> tags I won't touch.

Ideally, I had hoped to be able to have something that says: "change <p><em></em></p> to just <p></p> when between <form></form> tags". I wouldn't even mind if it was "remove all <em></em> tags when between <form></form> tags".

Well that is doable, just add to the previous suggestions so it matches only on a non greedy Form tag open followed by .* followed by prev example followed by .* followed by close form tag. I am typing on tablet and I lack the characters to show sample code

stumped 03-01-2020 12:46 PM

ps. you bracket the text fragments you want to keep, so you can refer to them as \1 \2 \3 in the replace forumala
so ( back on a pc keyboard now... find
<form>(.*)<em>(.*)</em>(.*)</form>

that finds stuff that is in em tags which are within form tags, and you have 3 text fragments which will be preserved
now assemble how you want it to look without the em tags so replace with
<form>\1\2\3</form>

MerlinMama 03-01-2020 01:44 PM

THANK YOU
 
:thanks:
I knew I was missing something simple. Leave it to me to complicate everything. I was using something similar to that (<form>(.*?)<em>|</em>), but as you can see, I would need multiple passes, And it would also jump to different sections on occasion. I got very annoyed and frustrated.

Using your expression, even if I still have to check, it will be immensely easier. I can also tweak it to only remove at the beginning and end of paragraphs, and then go through and change the other <em> tags to <strong> tags.

I'm babbling. Sorry.

Thanks again!:thanks:

stumped 03-01-2020 01:58 PM

Instead of \1\2\3 for the replace you can optionally put new tags either side of the \2

E.g. \1<strong>\2</strong>\3

stumped 03-01-2020 02:03 PM

Ps I only know a small subset of what can be done with regex, mostly learned by asking here!
You got lucky in that you wanted something similar to what I had done in another book tweak.

I use the sigil editor rather than the calibre one, and I think there is a thread of regex examples in The sigil forum

JSWolf 03-02-2020 04:47 PM

However, the change from <em> to <strong> can be done with Diaps Editing Toolbag editor plugin. You do have to configure it to add in strong to what em can be replaced with. And once done, you won't need regex.

Brett Merkey 03-02-2020 07:40 PM

One thing to consider in similar situations such as presented by the OP is to use CSS contextual selectors rather than subject the text to regex.

As I understand, there was a problem with italicized text within forms. A style rule could deal with that instantly:

form em {font-style: normal}

This would un-italicize anything within em tags which are in a form—while ignoring all other em tag content.

stumped 03-03-2020 03:04 AM

that's useful to know, and interesting.
I don't think I have ever seen a <form> tag when tweaking novels though.
what is the proper/normal use of <form> in book CSS ?

google tells me that <form> in HTML is used for user iput forms, which I guessed would be the case, but that make no sense in an EPUB ?

Brett Merkey 03-03-2020 06:53 AM

Quote:

don't think I have ever seen a <form> tag when tweaking novels though
LOL. I deliberately bit my tongue and did not venture into that issue. However, we can encounter all sorts of beasts in the HTML jungle. I once corrected a book that was done entirely in classed and nested <blockquote> tags. That was a fun learning experience, since blockquote tags can be nested but <p> tags cannot...


All times are GMT -4. The time now is 06:57 PM.

Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.