Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 02-05-2012, 12:56 PM   #1
meme
Sigil developer
meme ought to be getting tired of karma fortunes by now.meme ought to be getting tired of karma fortunes by now.meme ought to be getting tired of karma fortunes by now.meme ought to be getting tired of karma fortunes by now.meme ought to be getting tired of karma fortunes by now.meme ought to be getting tired of karma fortunes by now.meme ought to be getting tired of karma fortunes by now.meme ought to be getting tired of karma fortunes by now.meme ought to be getting tired of karma fortunes by now.meme ought to be getting tired of karma fortunes by now.meme ought to be getting tired of karma fortunes by now.
 
Posts: 1,275
Karma: 1101600
Join Date: Jan 2011
Location: UK
Device: Kindle PW, K4 NT, K3, Kobo Touch
Regex examples

I'd like to see if I can collect Regular Expressions (PCRE format as introduced in Sigil 0.5.0) used for common or difficult issues, and maybe add them to the FAQ, etc. Partly so I can have a list to refer to when needed, but also to collect a lot of what's probably already been mentioned in this forum. And maybe to find out if there isn't a way to do a replacement that's needed.

For instance, is there a regex to do other types of replacement but only inside body tags?

Is there one only for the actual text - words not part of a tag name or attribute? Words that are only aprt of a tag name or attribute?

If you have any suggestions for the above cases, or any other useful Regex expressions please post them.
meme is offline   Reply With Quote
Old 02-05-2012, 07:15 PM   #2
Timur
Connoisseur
Timur can name that ebook in five wordsTimur can name that ebook in five wordsTimur can name that ebook in five wordsTimur can name that ebook in five wordsTimur can name that ebook in five wordsTimur can name that ebook in five wordsTimur can name that ebook in five wordsTimur can name that ebook in five wordsTimur can name that ebook in five wordsTimur can name that ebook in five wordsTimur can name that ebook in five words
 
Posts: 54
Karma: 37363
Join Date: Aug 2011
Location: Istanbul
Device: EBW1150, Nook STR
Matches regex inside body element and inside character data only.
(First negative look-ahead(character data req.) works, although specification does not require the greater-than sign in character data to be escaped. But you have to save the epub at least once in Sigil and then reload it to escape all greater-than signs to > , or else you might miss some matches.)
(Second negative lookahead will not work if your document has more than one body element. Sigil allows this, but W3C validator gives error to such documents. I do not know the strict specifications for multiple body elements.)
Code:
(?s)regex(?![^<>]*>)(?!.*<body[^>]*>)
Matches regex only inside attribute values.
(If your document has single quotes(apostrophe) somewhere as attribute value delimiter instead of doubles, again, save and reload to change them all to double quotes, so that this regex works reliably. Saving and reloading also escapes all quotes inside attribute values to &quot; , so that your elements stay well-formed. Reloading also escapes all greater-than signs, otherwise you might have the risk of matching something inside character data.)
Code:
regex(?=[^<]*>)(?!(?:[^<"]*"[^<"]*")+\s*/?>)
Edit: Typo.
Edit 2: Added clarification in bold.
Edit 3: Slight simplification in the second code.

Last edited by Timur; 02-05-2012 at 07:35 PM.
Timur is offline   Reply With Quote
Advert
Old 02-05-2012, 07:32 PM   #3
JeremyR
Guru
JeremyR ought to be getting tired of karma fortunes by now.JeremyR ought to be getting tired of karma fortunes by now.JeremyR ought to be getting tired of karma fortunes by now.JeremyR ought to be getting tired of karma fortunes by now.JeremyR ought to be getting tired of karma fortunes by now.JeremyR ought to be getting tired of karma fortunes by now.JeremyR ought to be getting tired of karma fortunes by now.JeremyR ought to be getting tired of karma fortunes by now.JeremyR ought to be getting tired of karma fortunes by now.JeremyR ought to be getting tired of karma fortunes by now.JeremyR ought to be getting tired of karma fortunes by now.
 
JeremyR's Avatar
 
Posts: 973
Karma: 2458402
Join Date: Aug 2010
Location: St. Louis
Device: Kindle Keyboard, Nook HD+
This one uses the old format of Sigil Regex, but I find it very useful. Basically I use it to take a document that was text and thus not broken up into chapters but with them labeled, and to find them, highlight them and add the break marks

For books with Chapter I, Chapter II, and so on (Roman Numerals) or with Chapter 1, Chapter 2. (And of course, use it in code view)

Search for

CHAPTER [0-9XVI]+

And replace with

<hr class="sigilChapterBreak" /><h3>\0</h3>

On occasion it will find a phrase like "chapter in" in the text, but that's pretty rare (and just check the TOC before having it split)

Last edited by JeremyR; 02-05-2012 at 07:36 PM.
JeremyR is offline   Reply With Quote
Old 02-06-2012, 06:26 AM   #4
WS64
WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.
 
WS64's Avatar
 
Posts: 660
Karma: 506380
Join Date: Aug 2010
Location: Germany
Device: Kobo Aura / PB Lux 2 / Bookeen Frontlight / Kobo Mini / Nook Color
Quote:
Originally Posted by meme View Post
Is there one only for the actual text - words not part of a tag name or attribute?
F: (>[^<]*)old
R: \1new
(If the text contains > or < this will go wrong, but Sigil cleans the code up so it should work.)
WS64 is offline   Reply With Quote
Old 02-06-2012, 06:29 AM   #5
WS64
WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.WS64 ought to be getting tired of karma fortunes by now.
 
WS64's Avatar
 
Posts: 660
Karma: 506380
Join Date: Aug 2010
Location: Germany
Device: Kobo Aura / PB Lux 2 / Bookeen Frontlight / Kobo Mini / Nook Color
I'm collecting the regex expressions I often use here, I guess at least some of them are interesting for others too:
http://ws64.com/regex/ (this page lives, so there might be changes anytime, and of course, use at your own risk!)
WS64 is offline   Reply With Quote
Advert
Old 02-20-2012, 05:10 PM   #6
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 27,463
Karma: 192992430
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
I can usually bang my head against something long enough to figure it out, but I'm giving up and looking for help...

I'm looking for an expression that will locate span tags (of a specific class) that enclose more than one word. Let's just say the span class is "italics".

So I'm looking for an expression that will find:
Code:
<span class="italics">This is three words</span>
or:
Code:
<span class="italics">This is a great big bunch of words, maybe even a whole dang paragraph</span>
but not:
Code:
<span class="italics">one</span>
I know I'm probably missing something obvious and am going to feel like a complete idiot when I see the answer. I don't even really care what, exactly, gets gets highlighted, as I won't be using it to replace anything... only to manually eyeball the occurrence.
DiapDealer is offline   Reply With Quote
Old 02-20-2012, 05:23 PM   #7
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,689
Karma: 54369090
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by DiapDealer View Post
I can usually bang my head against something long enough to figure it out, but I'm giving up and looking for help...

I'm looking for an expression that will locate span tags (of a specific class) that enclose more than one word. Let's just say the span class is "italics".

So I'm looking for an expression that will find:
Code:
<span class="italics">This is three words</span>
or:
Code:
<span class="italics">This is a great big bunch of words, maybe even a whole dang paragraph</span>
but not:
Code:
<span class="italics">one</span>
I know I'm probably missing something obvious and am going to feel like a complete idiot when I see the answer. I don't even really care what, exactly, gets gets highlighted, as I won't be using it to replace anything... only to manually eyeball the occurrence.
Keyword Quantifier
Code:
<span class="italics">(\w+){2,}</span>
2 or more
theducks is online now   Reply With Quote
Old 02-20-2012, 06:48 PM   #8
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 27,463
Karma: 192992430
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Quote:
Originally Posted by theducks View Post
Keyword Quantifier
Code:
<span class="italics">(\w+){2,}</span>
2 or more
That seems to be finding all occurrences of <span class="italics"></span> that enclose 2 or more word characters. And it's still returning one-word instances, while skipping things like:
Code:
<span class="italics">Three weeks later</span></p>
And definitely skipping multiple word instances that contain punctuation and/or quotes:
Code:
<span class="italics">Well, dammit, it’s been two days.</span>
What else ya got?
DiapDealer is offline   Reply With Quote
Old 02-20-2012, 08:06 PM   #9
tilia
Evangelist
tilia ought to be getting tired of karma fortunes by now.tilia ought to be getting tired of karma fortunes by now.tilia ought to be getting tired of karma fortunes by now.tilia ought to be getting tired of karma fortunes by now.tilia ought to be getting tired of karma fortunes by now.tilia ought to be getting tired of karma fortunes by now.tilia ought to be getting tired of karma fortunes by now.tilia ought to be getting tired of karma fortunes by now.tilia ought to be getting tired of karma fortunes by now.tilia ought to be getting tired of karma fortunes by now.tilia ought to be getting tired of karma fortunes by now.
 
tilia's Avatar
 
Posts: 432
Karma: 1720909
Join Date: Mar 2011
Device: Voyage, K3
What about:

Code:
<span class="italics">"?\w+,?\.?\s

Last edited by tilia; 02-20-2012 at 08:22 PM. Reason: Typo
tilia is offline   Reply With Quote
Old 02-20-2012, 08:23 PM   #10
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,689
Karma: 54369090
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Code:
(?<=<span class="italics">)(\w+ ){1,}(\w+)(?=</span>)
theducks is online now   Reply With Quote
Old 02-20-2012, 10:14 PM   #11
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 27,463
Karma: 192992430
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Sorry theducks. That's some nice lookaheads/lookbehinds you have going on there, but I need it to include multi-word phrases that may have punctuation (“”‘’.,?!:;-—…).

Tilia's takes me to a lot more of what I'm looking for, but seems to skip multi-word phrases that have an apostrophe in the very first word.

Basically, I've accidentally blown up a lot of the italic spans in a document of mine. I'd like to be able to reliably find every instance of <span class="italics">(.*?)</span> EXCEPT instances where there's only one word in the span. That would narrow what I need to check from 700 instances down to... well... I'm not exactly sure of the number (obviously), but it would be a heck of a lot less than 700, anyway.
DiapDealer is offline   Reply With Quote
Old 02-20-2012, 11:17 PM   #12
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,689
Karma: 54369090
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by DiapDealer View Post
Sorry theducks. That's some nice lookaheads/lookbehinds you have going on there, but I need it to include multi-word phrases that may have punctuation (“”‘’.,?!:;-—…).

Tilia's takes me to a lot more of what I'm looking for, but seems to skip multi-word phrases that have an apostrophe in the very first word.

Basically, I've accidentally blown up a lot of the italic spans in a document of mine. I'd like to be able to reliably find every instance of <span class="italics">(.*?)</span> EXCEPT instances where there's only one word in the span. That would narrow what I need to check from 700 instances down to... well... I'm not exactly sure of the number (obviously), but it would be a heck of a lot less than 700, anyway. ;)
The trick I found:
there are 1 to n cases of a word followed by a space AND then a single word with No space. I don't know if [:punct:] will find mdash and ellipse
theducks is online now   Reply With Quote
Old 02-20-2012, 11:35 PM   #13
davidfor
Grand Sorcerer
davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.
 
Posts: 24,908
Karma: 47303748
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
How about:

Code:
<span class="italics">\w+\s+.*</span>
That seems to work in my tests. There is an issue with greediness as I happened to have a paragraph with two multiword italic sections in my test book. The search worked but it selected the two italic sections and everything between them. But it didn't find any of the single word italics.
davidfor is offline   Reply With Quote
Old 02-21-2012, 12:45 AM   #14
Timur
Connoisseur
Timur can name that ebook in five wordsTimur can name that ebook in five wordsTimur can name that ebook in five wordsTimur can name that ebook in five wordsTimur can name that ebook in five wordsTimur can name that ebook in five wordsTimur can name that ebook in five wordsTimur can name that ebook in five wordsTimur can name that ebook in five wordsTimur can name that ebook in five wordsTimur can name that ebook in five words
 
Posts: 54
Karma: 37363
Join Date: Aug 2011
Location: Istanbul
Device: EBW1150, Nook STR
@davidfor: Add (?U) in front of your regexp for lazy matching.
Timur is offline   Reply With Quote
Old 02-21-2012, 07:13 AM   #15
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 27,463
Karma: 192992430
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Correct me if wrong davidfor, but won't that still exclude instances where the very first word contains an apostrophe, or any other non-word character? It gets me very close, certainly, but I just don't think what I'm looking for is going to be based on "\w". The potential for too many non-word characters being present in the words (including the first one) is just too great.

I'm not so much concerned with the greediness of the expression (as I'm not blindly replacing anything with it) as I am with seeing every single non-one-word occurrence... regardless if that occurrence contains non-word characters or not.
DiapDealer is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Examples of Subgroups emonti8384 Lounge 32 02-26-2011 06:00 PM
Accessories Pen examples Gunnerp245 enTourage Archive 15 02-21-2011 03:23 PM
Stylesheet examples? Skitzman69 Sigil 15 09-24-2010 08:24 PM
Examples kafkaesque1978 iRiver Story 1 07-26-2010 03:49 PM
Looking for examples of typos in eBooks Tonycole General Discussions 1 05-05-2010 04:23 AM


All times are GMT -4. The time now is 09:22 AM.


MobileRead.com is a privately owned, operated and funded community.