Regex examples - Page 34

Toxaris · 07-27-2016, 11:01 AM

Quote:

Originally Posted by ReaderRabbit

OK, here is a simple question for ya. In Sigil (0.7.4), I have a book where there is no separation between sentences. I am using this to find them: ([a-z])([\.\,\?\!])([A-Z])
which works perfectly. But what do I use in replace to move the new sentence over one space? There is over 3500 found and I don't want to insert a space manually for that many errors. Any suggestions?

How about: \1\2 \3

theducks · 07-27-2016, 11:28 AM

that might miss those that start or end with quotes

Code:

 ([a-z])([\.\,\?\!]["]*)(["]*[A-Z])

(I only show straight quotes. 0 or 1)

eschwartz · 07-27-2016, 11:40 AM

Quote:

Originally Posted by Psymon

Hope it's okay for a veritable Regex newbie to post a query in this thread -- I'm only just beginning to learn about this stuff, but with any like it'll eventually start sinking in.

Ah, that is exactly what this thread is for.

Certainly we don't expect the people who already know the answer to ask questions...

Quote:

I seem to have developed an affinity for doing up electronic versions of "ye olde bookes" -- for example, right now I'm doing up several Shakespeare plays in the original Elizabethan English, endeavouring to give it somewhat of the "look and feel" of early typographic styles, complete with use of the long-ess (i.e. "ſ", the character that looks like an "f" but without the crossbar, and is actually an "s"). Along with the unusual use of the "u" and "v" characters in early typography, where an "ſ" is use instead of "s" has to do with placement within a word, rather than the "sound" of the character or anything else like that.

Very often when I find digital transcriptions of these early texts, they've kept the "u" and "v" oddities, but for some reason have changed all the long-esses to just "s" instead -- and so I have to change them back. The rule for when this is supposed to occur is actually fairly simple (although not all early printers/typographers followed this, but the vast majority did): virtually every instance of "s" should be changed to "ſ" unless it falls at the end of the word, then it remains as "s."

So to fix my texts up, I've been searching for every instance of "s" and then changing it to "ſ" -- which right away causes all my HTML code to need to be fixed up first, because things like "css," "class," "span," etc. get screwed up in the process -- and then I do another series of searches, looking for instances of "ſ" (long-ess) plus a "." or "," or ":" or ";" or "?" or "!" or ")" or "[space]" or "[apostrophe -- curly or otherwise], plus "<" should there be a closing or tag or something, i.e. wherever it might occur at the end of a word, and then changing it back to "s" again.

It's not that big a deal, actually, I can "correct" the long-esses in a gtwhole book in, like, 5 or 10 minutes or so, but it would be totally cool to just whiz it off with one, single regex search, of course.

Oh, and it would have to be case-sensitive, of course -- all instances of upper-case "S" remain as "S."

[snip -- same thing with other character substitutions]

Case-sensitivity is a setting in the S&R box.

Using the power of lookaround zero-length assertions and word boundary zero-length assertions, the following regex will find a character-that-is-not-at-the-end-of-a-word (in this case "s") that is not inside HTML tags:

Find:

Code:

(?<=>[^<]*)s\B(?=[^>]*<)

Replace: (you guessed this one already, right?)

Code:

ſ

Explanation:

Just check for a tag closing character ">", followed by zero or more characters-that-aren't-a-tag-opener-"<"... wrapped in a lookbehind, so you don't clutter up the actual match.
Followed by a random character -- whatever you are looking for, in this case "s" -- followed by a negated word boundary zero-length assertion "\B".
Followed by zero or more characters-that-aren't-a-tag-closer-">" followed by a tag opener "<"... again wrapped in a lookahead, so you don't clutter up the actual match.

ReaderRabbit · 07-27-2016, 11:43 AM

Quote:

Originally Posted by Toxaris

How about: \1\2 \3

Wonderful! Worked perfectly

Leonatus · 08-20-2016, 11:57 AM

Hi all Regex cracks!
Is there a way to remove by one expression all anchor tags in an epub with the following syntax:

Code:

<a name="pagexx" title="yy" id="pagexx"></a>

where xx stands for the page number, and yy for diverse abbreviations of former issuers.

Maybe it's even not so difficult, but It's too much for my poor old brains.

Thanks in advance!

theducks · 08-20-2016, 12:27 PM

Quote:

Originally Posted by Leonatus

Hi all Regex cracks!
Is there a way to remove by one expression all anchor tags in an epub with the following syntax:

Code:

<a name="pagexx" title="yy" id="pagexx"></a>

where xx stands for the page number, and yy for diverse abbreviations of former issuers.

Maybe it's even not so difficult, but It's too much for my poor old brains.

Thanks in advance!

Yes
Select an example, ctrl-F (This also puts the selection in Find)
Right click in the find box: Tokenize

Replace should be: either blank or a space
you could also do it the other way:
replace each SET of the numbers with a \d+ (one or more digits, an Integer)

Turtle91 · 08-20-2016, 12:37 PM

^^^ What theducks said.
eg:
find: <a name="page\d+" title="\d+" id="page\d+"></a>
replace: blank
if what you are replacing is JUST numbers

-or-

find: <a name="page(.*?)" title="(.*?)" id="page(.*?)"></a>
replace: blank
if what you are replacing can include letters or symbols.

Leonatus · 08-20-2016, 12:37 PM

Thank you! It works while the issuer names are identic. But in fact, there are several names. It should be possible to catch them all. (Some names are separated by slashes, b.t.w.)
This refers to theducks' answer.

Turtle91 · 08-20-2016, 12:41 PM

Ooops - ninjad you Leonatus!

Doitsu · 08-20-2016, 01:22 PM

@Leonatus:

Use the following quick and dirty regex:

Code:

<a name=".*?" title=".*?" id=".*?"></a>

Leonatus · 08-20-2016, 01:28 PM

Thank you all! Works like a charm! Fantastic!

(But I committed the error to leave the "dot all" and "minimal match" boxes checked. The result is the loss of big parts of text. So, whoever wishes to take profit from this item, take care!)

Psymon · 09-16-2016, 03:26 PM

Hey, folks -- I am trying to learn/do this regex stuff on my own (however slowly)! I'm stumped on something that I would think should be fairly easy, though.

In my book, I've got almost 300 paragraphs that start off with a dropcap, with this being an example of how those paragraphs begin...

Code:

<span class="initial">H</span>onourable

What I want to do is make that first word in smallcaps, and so the code in this latter example would then be...

Code:

<span class="initial">H</span><span class="smallcaps">ONOURABLE</span>

So basically what I want to do is convert the case of that first word to uppercase and then wrap that smallcaps span around the relevant part of the word.

For my regex search I initially came up with this...

(.+?)([^>]*)\s

...and for replace this...

\1\U\2\E

...(and in this latter there's an invisible space there that I suppose you won't "see" in this post -- but it would be there in my S&R, of course).

For the life of me, though, that \s won't stop at the first space, that is, after the first word -- it selects the entire paragraph up to the last space in the paragraph! -- and it's also possible that there might actually be not a space, but a comma (or other punctuation) instead, and I'd like that closing span (for my smallcaps) to come before that.

I've searched around the 'net trying to find the solution to this, but just can't seem to find it -- every "answer" that I find on other sites and try just doesn't seem to work.

Thanks in advance, if anyone can help!

(PS. I'm not sure if my "replace" code is correct either, actually -- although I never got that far with figuring this out!)

DiapDealer · 09-16-2016, 04:58 PM

Quote:

Originally Posted by Psymon

For my regex search I initially came up with this...

(.+?)([^>]*)\s

Try this instead:

Code:

<span class="initial">(.+?)</span>(\w*)

You don't need to escape the quotation marks in your search criteria. The \w will match only word characters, which means it will stop before any punctuation that might occur (an apostrophe in the word you want to smallcap will trip this up).

If any unicode characters can be expected, you may want to make the \w unicode-aware with the (*UCP) command.

Code:

(*UCP)<span class="initial">(.+?)</span>(\w*)

The (.+?) part can be a bit greedy. If one letter is all that's ever expected, I'd probably use (\w) instead.

Code:

(*UCP)<span class="initial">(\w)</span>(\w*)

If an opening quote may be in the raised|dropped cap as well, then explicitly include it (making it optional of course):

Code:

(*UCP)<span class="initial">(“?\w)</span>(\w*)

Probably gonna play hell on one-letter drop|raised-cap words, too ("I" and "A"), though.

Quote:

Originally Posted by Psymon

...and for replace this...

\1\U\2\E

The replace should work fine as is.

To eliminate the issue of one-letter word drop/smallcaps, I'd probably do something like.
FIND:

Code:

(*UCP)<span class="initial">“?\w</span>\K(\w*)

REPLACE:

Code:

<span class="smallcaps">\U\1\E</span>

EDIT: None of the optional regex search options should be checked (other than maybe the "wrap" option) for any of my examples, by the way.

Psymon · 09-16-2016, 05:39 PM

Quote:

Originally Posted by DiapDealer

Try this instead:

<big snip>

Thank you so, so much, DiapDealer! That did indeed seem to do the trick! I know I did have at least one (maybe more) one-letter opening words, but I'll find out eventually if anything went funny there -- once my book is done, I'll be going through the entire thing page-by-page (several times, in different orientations, etc.) too look for any weirdness going on anywhere.

In the meantime, though, that does seem to do the have done the trick! And thank you so much, too, for your detailed explanation of everything -- I'll study that more closely as well, and do my best to learn from it!

DiapDealer · 09-16-2016, 06:23 PM

Glad to help. Good luck!

07-27-2016, 11:28 AM	#497
theducks Well trained by Cats Posts: 31,549 Karma: 62543878 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A	that might miss those that start or end with quotes Code: ([a-z])([\.\,\?\!]["])(["][A-Z]) (I only show straight quotes. 0 or 1)

08-20-2016, 11:57 AM	#500
Leonatus Wizard Posts: 1,113 Karma: 11669487 Join Date: Mar 2013 Location: Guben, Brandenburg, Germany Device: Kobo Clara 2E, Tolino Shine 3	Hi all Regex cracks! Is there a way to remove by one expression all anchor tags in an epub with the following syntax: Code: <a name="pagexx" title="yy" id="pagexx"></a> where xx stands for the page number, and yy for diverse abbreviations of former issuers. Maybe it's even not so difficult, but It's too much for my poor old brains. Thanks in advance!

08-20-2016, 12:37 PM	#503
Leonatus Wizard Posts: 1,113 Karma: 11669487 Join Date: Mar 2013 Location: Guben, Brandenburg, Germany Device: Kobo Clara 2E, Tolino Shine 3	Thank you! It works while the issuer names are identic. But in fact, there are several names. It should be possible to catch them all. (Some names are separated by slashes, b.t.w.) This refers to theducks' answer. Last edited by Leonatus; 08-20-2016 at 12:40 PM.

08-20-2016, 01:22 PM	#505
Doitsu Grand Sorcerer Posts: 5,795 Karma: 24088595 Join Date: Dec 2010 Device: Kindle PW2	@Leonatus: Use the following quick and dirty regex: Code: <a name=".?" title=".?" id=".*?"></a>

08-20-2016, 01:28 PM	#506
Leonatus Wizard Posts: 1,113 Karma: 11669487 Join Date: Mar 2013 Location: Guben, Brandenburg, Germany Device: Kobo Clara 2E, Tolino Shine 3	Thank you all! Works like a charm! Fantastic! (But I committed the error to leave the "dot all" and "minimal match" boxes checked. The result is the loss of big parts of text. So, whoever wishes to take profit from this item, take care!) Last edited by Leonatus; 08-20-2016 at 01:40 PM.

08-20-2016, 12:37 PM	#502
Turtle91 A Hairy Wizard Posts: 3,462 Karma: 20534347 Join Date: Dec 2012 Location: Charleston, SC today Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire	^^^ What theducks said. eg: find: <a name="page\d+" title="\d+" id="page\d+"></a> replace: blank if what you are replacing is JUST numbers -or- find: <a name="page(.?)" title="(.?)" id="page(.?)"></a> replace: blank* if what you are replacing can include letters or symbols.

08-20-2016, 12:41 PM	#504
Turtle91 A Hairy Wizard Posts: 3,462 Karma: 20534347 Join Date: Dec 2012 Location: Charleston, SC today Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire	Ooops - ninjad you Leonatus!

09-16-2016, 03:26 PM	#507
Psymon Chief Bohemian Misfit Posts: 571 Karma: 462964 Join Date: May 2013 Device: iPad, ADE	Hey, folks -- I am trying to learn/do this regex stuff on my own (however slowly)! I'm stumped on something that I would think should be fairly easy, though. In my book, I've got almost 300 paragraphs that start off with a dropcap, with this being an example of how those paragraphs begin... Code: <span class="initial">H</span>onourable What I want to do is make that first word in smallcaps, and so the code in this latter example would then be... Code: <span class="initial">H</span><span class="smallcaps">ONOURABLE</span> So basically what I want to do is convert the case of that first word to uppercase and then wrap that smallcaps span around the relevant part of the word. For my regex search I initially came up with this... <span class=\"initial\">(.+?)</span>([^>])\s ...and for replace this... <span class="initial">\1</span><span class="smallcaps">\U\2\E</span> ...(and in this latter there's an invisible space there that I suppose you won't "see" in this post -- but it would be there in my S&R, of course). For the life of me, though, that \s won't stop at the first space, that is, after the first word -- it selects the entire paragraph up to the last* space in the paragraph! -- and it's also possible that there might actually be not a space, but a comma (or other punctuation) instead, and I'd like that closing span (for my smallcaps) to come before that. I've searched around the 'net trying to find the solution to this, but just can't seem to find it -- every "answer" that I find on other sites and try just doesn't seem to work. Thanks in advance, if anyone can help! (PS. I'm not sure if my "replace" code is correct either, actually -- although I never got that far with figuring this out!) Last edited by Psymon; 09-16-2016 at 03:35 PM.

09-16-2016, 06:23 PM	#510
DiapDealer Grand Sorcerer Posts: 29,140 Karma: 211348980 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	Glad to help. Good luck!

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Examples of Subgroups	emonti8384	Lounge	32	02-26-2011 07:00 PM
Accessories Pen examples	Gunnerp245	enTourage Archive	15	02-21-2011 04:23 PM
Stylesheet examples?	Skitzman69	Sigil	15	09-24-2010 09:24 PM
Examples	kafkaesque1978	iRiver Story	1	07-26-2010 04:49 PM
Looking for examples of typos in eBooks	Tonycole	General Discussions	1	05-05-2010 05:23 AM