MobileRead Forums

MobileRead Forums (https://www.mobileread.com/forums/index.php)
-   Sigil (https://www.mobileread.com/forums/forumdisplay.php?f=203)
-   -   Regex examples (https://www.mobileread.com/forums/showthread.php?t=167971)

meme 02-05-2012 01:56 PM

Regex examples
 
I'd like to see if I can collect Regular Expressions (PCRE format as introduced in Sigil 0.5.0) used for common or difficult issues, and maybe add them to the FAQ, etc. Partly so I can have a list to refer to when needed, but also to collect a lot of what's probably already been mentioned in this forum. And maybe to find out if there isn't a way to do a replacement that's needed.

For instance, is there a regex to do other types of replacement but only inside body tags?

Is there one only for the actual text - words not part of a tag name or attribute? Words that are only aprt of a tag name or attribute?

If you have any suggestions for the above cases, or any other useful Regex expressions please post them.

Timur 02-05-2012 08:15 PM

Matches regex inside body element and inside character data only.
(First negative look-ahead(character data req.) works, although specification does not require the greater-than sign in character data to be escaped. But you have to save the epub at least once in Sigil and then reload it to escape all greater-than signs to > , or else you might miss some matches.)
(Second negative lookahead will not work if your document has more than one body element. Sigil allows this, but W3C validator gives error to such documents. I do not know the strict specifications for multiple body elements.)
Code:

(?s)regex(?![^<>]*>)(?!.*<body[^>]*>)
Matches regex only inside attribute values.
(If your document has single quotes(apostrophe) somewhere as attribute value delimiter instead of doubles, again, save and reload to change them all to double quotes, so that this regex works reliably. Saving and reloading also escapes all quotes inside attribute values to &quot; , so that your elements stay well-formed. Reloading also escapes all greater-than signs, otherwise you might have the risk of matching something inside character data.)
Code:

regex(?=[^<]*>)(?!(?:[^<"]*"[^<"]*")+\s*/?>)
Edit: Typo.
Edit 2: Added clarification in bold.
Edit 3: Slight simplification in the second code.

JeremyR 02-05-2012 08:32 PM

This one uses the old format of Sigil Regex, but I find it very useful. Basically I use it to take a document that was text and thus not broken up into chapters but with them labeled, and to find them, highlight them and add the break marks

For books with Chapter I, Chapter II, and so on (Roman Numerals) or with Chapter 1, Chapter 2. (And of course, use it in code view)

Search for

CHAPTER [0-9XVI]+

And replace with

<hr class="sigilChapterBreak" /><h3>\0</h3>

On occasion it will find a phrase like "chapter in" in the text, but that's pretty rare (and just check the TOC before having it split)

WS64 02-06-2012 07:26 AM

Quote:

Originally Posted by meme (Post 1954109)
Is there one only for the actual text - words not part of a tag name or attribute?

F: (>[^<]*)old
R: \1new
(If the text contains > or < this will go wrong, but Sigil cleans the code up so it should work.)

WS64 02-06-2012 07:29 AM

I'm collecting the regex expressions I often use here, I guess at least some of them are interesting for others too:
http://ws64.com/regex/ (this page lives, so there might be changes anytime, and of course, use at your own risk!)

DiapDealer 02-20-2012 06:10 PM

I can usually bang my head against something long enough to figure it out, but I'm giving up and looking for help...

I'm looking for an expression that will locate span tags (of a specific class) that enclose more than one word. Let's just say the span class is "italics".

So I'm looking for an expression that will find:
Code:

<span class="italics">This is three words</span>
or:
Code:

<span class="italics">This is a great big bunch of words, maybe even a whole dang paragraph</span>
but not:
Code:

<span class="italics">one</span>
I know I'm probably missing something obvious and am going to feel like a complete idiot when I see the answer. I don't even really care what, exactly, gets gets highlighted, as I won't be using it to replace anything... only to manually eyeball the occurrence.

theducks 02-20-2012 06:23 PM

Quote:

Originally Posted by DiapDealer (Post 1973410)
I can usually bang my head against something long enough to figure it out, but I'm giving up and looking for help...

I'm looking for an expression that will locate span tags (of a specific class) that enclose more than one word. Let's just say the span class is "italics".

So I'm looking for an expression that will find:
Code:

<span class="italics">This is three words</span>
or:
Code:

<span class="italics">This is a great big bunch of words, maybe even a whole dang paragraph</span>
but not:
Code:

<span class="italics">one</span>
I know I'm probably missing something obvious and am going to feel like a complete idiot when I see the answer. I don't even really care what, exactly, gets gets highlighted, as I won't be using it to replace anything... only to manually eyeball the occurrence.

Keyword Quantifier :D
Code:

<span class="italics">(\w+){2,}</span>
2 or more

DiapDealer 02-20-2012 07:48 PM

Quote:

Originally Posted by theducks (Post 1973428)
Keyword Quantifier :D
Code:

<span class="italics">(\w+){2,}</span>
2 or more

That seems to be finding all occurrences of <span class="italics"></span> that enclose 2 or more word characters. And it's still returning one-word instances, while skipping things like:
Code:

<span class="italics">Three weeks later</span></p>
And definitely skipping multiple word instances that contain punctuation and/or quotes:
Code:

<span class="italics">Well, dammit, it’s been two days.</span>
What else ya got? :D

tilia 02-20-2012 09:06 PM

What about:

Code:

<span class="italics">"?\w+,?\.?\s

theducks 02-20-2012 09:23 PM

Code:

(?<=<span class="italics">)(\w+ ){1,}(\w+)(?=</span>)

DiapDealer 02-20-2012 11:14 PM

Sorry theducks. That's some nice lookaheads/lookbehinds you have going on there, but I need it to include multi-word phrases that may have punctuation (“”‘’.,?!:;-—…).

Tilia's takes me to a lot more of what I'm looking for, but seems to skip multi-word phrases that have an apostrophe in the very first word.

Basically, I've accidentally blown up a lot of the italic spans in a document of mine. I'd like to be able to reliably find every instance of <span class="italics">(.*?)</span> EXCEPT instances where there's only one word in the span. That would narrow what I need to check from 700 instances down to... well... I'm not exactly sure of the number (obviously), but it would be a heck of a lot less than 700, anyway. ;)

theducks 02-21-2012 12:17 AM

Quote:

Originally Posted by DiapDealer (Post 1973697)
Sorry theducks. That's some nice lookaheads/lookbehinds you have going on there, but I need it to include multi-word phrases that may have punctuation (“”‘’.,?!:;-—…).

Tilia's takes me to a lot more of what I'm looking for, but seems to skip multi-word phrases that have an apostrophe in the very first word.

Basically, I've accidentally blown up a lot of the italic spans in a document of mine. I'd like to be able to reliably find every instance of <span class="italics">(.*?)</span> EXCEPT instances where there's only one word in the span. That would narrow what I need to check from 700 instances down to... well... I'm not exactly sure of the number (obviously), but it would be a heck of a lot less than 700, anyway. ;)

The trick I found:
there are 1 to n cases of a word followed by a space AND then a single word with No space. I don't know if [:punct:] will find mdash and ellipse

davidfor 02-21-2012 12:35 AM

How about:

Code:

<span class="italics">\w+\s+.*</span>
That seems to work in my tests. There is an issue with greediness as I happened to have a paragraph with two multiword italic sections in my test book. The search worked but it selected the two italic sections and everything between them. But it didn't find any of the single word italics.

Timur 02-21-2012 01:45 AM

@davidfor: Add (?U) in front of your regexp for lazy matching.

DiapDealer 02-21-2012 08:13 AM

Correct me if wrong davidfor, but won't that still exclude instances where the very first word contains an apostrophe, or any other non-word character? It gets me very close, certainly, but I just don't think what I'm looking for is going to be based on "\w". The potential for too many non-word characters being present in the words (including the first one) is just too great.

I'm not so much concerned with the greediness of the expression (as I'm not blindly replacing anything with it) as I am with seeing every single non-one-word occurrence... regardless if that occurrence contains non-word characters or not.


All times are GMT -4. The time now is 07:52 PM.

Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.