Regex examples - Page 11

DiapDealer · 09-27-2012, 07:35 PM

Quote:

The HTML code looks like:

Code:

<p class="calibre"><span>bad policy to answer a</span></p>

<p class="calibre"><span>direct question. He kept shaking his head like a china figure.

Ugh. Those empty spans surrounding literally everything are always a pain in the ass. You'll almost surely need to get rid of them first. The problem is ... there can be nested spans (italics/bolds/etc) within them. And that makes it quite painful to regex them away (without funkifying your "real" formatting spans).

If I have the original text to proof against, I sometimes find it easier (and less frustrating) just to blast ALL the spans away. Every single one. And then redo any italic and/or other special formatting using the physical copy as a guide. It's drastic, yes, but sometimes it's less drastic than fixing the havoc that a regex run on nested spans can wreak.

In one fell swoop, all span tags (opening and closing) ... gone (when you replace it with nothing of course):

Code:

</?span[^>]*?>

It all depends on the complexity of the book's formatting, of course. I may not always opt for the "nuclear" span removal approach, but I've done it quite a few times.

Use with an appropriate level of trepidation, of course...

JMikeD · 09-28-2012, 12:15 AM

Quote:

Originally Posted by DiapDealer

Use with an appropriate level of trepidation, of course...

It's probably just as easy to export the entire thing to RTF, clean everything up in OpenOffice and use the ePub Export extension in OO. That gives pretty clean results.

Jellby · 09-28-2012, 04:01 AM

I'd first identify the spans that do something (search for ", ...), or with some other temporary mark), then delete the remaining bogus spans.

WS64 · 09-29-2012, 03:01 PM

I would remove ALL (without anything behind) and let Tidy remove the corresponding closing spans.

Then search for


([a-z])
and replace it with
_\1
(_ = blank)

Also search for
([a-zA-Z,])


and replace it with
\1_
(_ = blank)

mrmikel · 10-23-2012, 07:14 AM

Just a very simple expression for finding instances of period, followed by a space, by a lower case letter, caused by poor OCR.

\. ([a-z])

Not a candidate for auto search and replace because it matches abbreviations, too.

Toxaris · 10-24-2012, 01:10 AM

Quote:

Originally Posted by WS64

I would remove ALL (without anything behind) and let Tidy remove the corresponding closing spans.

I would strongly recommend not to do that. If there are nested spans, Tidy doesn't always remove the correct closing span. That can make a real mess out of your book.

It is never a good idea to trust Tidy to make the right choice....

roger64 · 10-25-2012, 12:58 AM

Now that Sigil has a nice Search Editor, I can add some more regex.
I would like to set up this one:

It's about superscript text:

The text me mes lle lles er e o placed within a sup tag and followed by a normal space should instead be followed by a  

Say me(normal space) should be replaced by
me 

(me, lle are superscript short for M(adame), M(ademoiselle)...

I hope I have been clear enough...

Perkin · 10-25-2012, 05:29 AM

Search (put a space at end after )

Code:

<sup>(me|mes|lle|lles|er|e|o)</sup>

Replace

Code:

<sup>\1</sup>&nbsp;

roger64 · 10-25-2012, 05:57 AM

@Perkin

Thanks a lot for your help. I did not know how to deal with the "false" words like me, mes...

Already in use.

roger64 · 12-23-2012, 08:47 AM

Hi

I try to set up a regex for French language.

We have some acronyms linked with hyphens 8208 like
J.-C, P.-D.G. (and the list can grow) They are always unhappily hyphenated and it would be much better if they were not. That's why I would like to replace their hyphens with non-breaking hyphens 8209
I do not know how to set up this regex. Ideally, I would like to be able to just add easily one new word.

I think there must be better than this.
I wrote only 8208 instead of the full &#...:

Search: (J.|P.)8208(C.|D.G)
Replace: \18209\2

Jellby · 12-23-2012, 09:43 AM

Are there instances of hyphen after a period that you do not want to replace? If there aren't you can just replace all ".-" with ".¬" (where I use ¬ for the non-breaking hyphen), with appropriate escaping of the period if needed.

Doitsu · 12-23-2012, 09:46 AM

I'm sure that the Regex gurus will come up with a much more efficient Regex, but I'd simply search for a capital letter with a period followed by ‐ and another capital letter followed by a period:

Find: ([[:upper:]]\.)‐([[:upper:]]\.)
Replace: \1‑\2

This should work in Sigl and any other Editor with PCRE support.

roger64 · 12-23-2012, 10:46 PM

Hi

Like for many things, I gather experience book after book. After preparing an history book, I realized that to use a hyphen for J.-C. (70 occurrences of it in one book) was NOT a nice idea.

I have no idea how many words of this kind I may find and I am really not sure that all occurrences of .- should deserve the same treatment. That's why, I thought first to add them one by one.

But, in fact, I realize there does not seem to be a very big risk to try your solutions. So I will try them. Thanks for them.

And enjoy a Merry Chrismas.

Jellby · 12-24-2012, 03:36 AM

Well, try searching for ".-" first and see which occurrences you find. With any luck you'll see they all want to be non-breaking, or you may see a pattern (like Doitsu's suggestion) and find some typos

mzmm · 01-09-2013, 09:08 AM

found myself parsing messy html today, removing empty tags, or tags containing  , or , etc. so that i could space the paragraphs consistently in css, and, inspired by this thread, thought i'd share the snippet in case anyone has a use for it.

i realize it could probably be more concise, and i wouldn't just blindly replace all, but it seems to do the job. it removes tags that may also contain , , , have no content, or 1 or more spaces, or a , , .

Code:

<p[^>]*>((<\w+[^>/]*>)+)?(<br((\s)?/)?>|&nbsp;|\s*)((</\w+[^>]*>)+)?</p>

09-28-2012, 04:01 AM	#153
Jellby frumious Bandersnatch Posts: 7,516 Karma: 18512745 Join Date: Jan 2008 Location: Spaniard in Sweden Device: Cybook Orizon, Kobo Aura	I'd first identify the spans that do something (search for "<span ", replace them with something more meaningful (<i>, <strong>...), or with some other temporary mark), then delete the remaining bogus spans.

09-29-2012, 03:01 PM	#154
WS64 ♫ Posts: 660 Karma: 506380 Join Date: Aug 2010 Location: Germany Device: Kobo Aura / PB Lux 2 / Bookeen Frontlight / Kobo Mini / Nook Color	I would remove ALL <span> (without anything behind) and let Tidy remove the corresponding closing spans. Then search for </p> <p class="calibre">([a-z]) and replace it with _\1 (_ = blank) Also search for ([a-zA-Z,])</p> <p class="calibre"> and replace it with \1_ (_ = blank)

10-25-2012, 12:58 AM	#157
roger64 Wizard Posts: 2,608 Karma: 3000161 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	Now that Sigil has a nice Search Editor, I can add some more regex. I would like to set up this one: It's about superscript text: The text me mes lle lles er e o placed within a sup tag and followed by a normal space should instead be followed by a * * Say <sup>me</sup>(normal space) should be replaced by <sup>me</sup>  (me, lle are superscript short for M(adame), M(ademoiselle)... I hope I have been clear enough...

10-25-2012, 05:29 AM	#158
Perkin Guru Posts: 655 Karma: 64171 Join Date: Sep 2010 Location: Kent, England, Sol 3, ZZ9 plural Z Alpha Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)	Search (put a space at end after </sup>) Code: <sup>(me\|mes\|lle\|lles\|er\|e\|o)</sup> Replace Code: <sup>\1</sup>

10-25-2012, 05:57 AM	#159
roger64 Wizard Posts: 2,608 Karma: 3000161 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	@Perkin Thanks a lot for your help. I did not know how to deal with the "false" words like me, mes... Already in use. Last edited by roger64; 10-25-2012 at 09:30 AM.

10-23-2012, 07:14 AM	#155
mrmikel Color me gone Posts: 2,089 Karma: 1445295 Join Date: Apr 2008 Location: Central Oregon Coast Device: PRS-300	Just a very simple expression for finding instances of period, followed by a space, by a lower case letter, caused by poor OCR. \. ([a-z]) Not a candidate for auto search and replace because it matches abbreviations, too.

12-23-2012, 08:47 AM	#160
roger64 Wizard Posts: 2,608 Karma: 3000161 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	To replace hyphen with non-breaking hyphen Hi I try to set up a regex for French language. We have some acronyms linked with hyphens 8208 like J.-C, P.-D.G. (and the list can grow) They are always unhappily hyphenated and it would be much better if they were not. That's why I would like to replace their hyphens with non-breaking hyphens 8209 I do not know how to set up this regex. Ideally, I would like to be able to just add easily one new word. I think there must be better than this. I wrote only 8208 instead of the full &#...: Search: (J.\|P.)8208(C.\|D.G) Replace: \18209\2

12-23-2012, 09:43 AM	#161
Jellby frumious Bandersnatch Posts: 7,516 Karma: 18512745 Join Date: Jan 2008 Location: Spaniard in Sweden Device: Cybook Orizon, Kobo Aura	Are there instances of hyphen after a period that you do not want to replace? If there aren't you can just replace all ".-" with ".¬" (where I use ¬ for the non-breaking hyphen), with appropriate escaping of the period if needed.

12-23-2012, 09:46 AM	#162
Doitsu Grand Sorcerer Posts: 5,584 Karma: 22735033 Join Date: Dec 2010 Device: Kindle PW2	I'm sure that the Regex gurus will come up with a much more efficient Regex, but I'd simply search for a capital letter with a period followed by ‐ and another capital letter followed by a period: Find: ([[:upper:]]\.)‐([[:upper:]]\.) Replace: \1‑\2 This should work in Sigl and any other Editor with PCRE support.

12-23-2012, 10:46 PM	#163
roger64 Wizard Posts: 2,608 Karma: 3000161 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	Hi Like for many things, I gather experience book after book. After preparing an history book, I realized that to use a hyphen for J.-C. (70 occurrences of it in one book) was NOT a nice idea. I have no idea how many words of this kind I may find and I am really not sure that all occurrences of .- should deserve the same treatment. That's why, I thought first to add them one by one. But, in fact, I realize there does not seem to be a very big risk to try your solutions. So I will try them. Thanks for them. And enjoy a Merry Chrismas.

12-24-2012, 03:36 AM	#164
Jellby frumious Bandersnatch Posts: 7,516 Karma: 18512745 Join Date: Jan 2008 Location: Spaniard in Sweden Device: Cybook Orizon, Kobo Aura	Well, try searching for ".-" first and see which occurrences you find. With any luck you'll see they all want to be non-breaking, or you may see a pattern (like Doitsu's suggestion) and find some typos

01-09-2013, 09:08 AM	#165
mzmm Groupie Posts: 171 Karma: 86271 Join Date: Feb 2012 Device: iPad, Kindle Touch, Sony PRS-T1	found myself parsing messy html today, removing empty <p> tags, or <p> tags containing  , or <p><i></i></p>, <p><b> </b><p> etc. so that i could space the paragraphs consistently in css, and, inspired by this thread, thought i'd share the snippet in case anyone has a use for it. i realize it could probably be more concise, and i wouldn't just blindly replace all, but it seems to do the job. it removes <p> tags that may also contain <b>, <i>, <span>, have no content, or 1 or more spaces, or a <br>,<br/>,<br />. Code: <p[^>]>((<\w+[^>/]>)+)?(<br((\s)?/)?>\| \|\s)((</\w+[^>]>)+)?</p> Last edited by mzmm; 01-09-2013 at 09:15 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Examples of Subgroups	emonti8384	Lounge	32	02-26-2011 06:00 PM
Accessories Pen examples	Gunnerp245	enTourage Archive	15	02-21-2011 03:23 PM
Stylesheet examples?	Skitzman69	Sigil	15	09-24-2010 08:24 PM
Examples	kafkaesque1978	iRiver Story	1	07-26-2010 03:49 PM
Looking for examples of typos in eBooks	Tonycole	General Discussions	1	05-05-2010 04:23 AM