RegEx question about repeating

phossler · 03-12-2015, 11:31 AM

I have simple stored search RE to replace em, en, and dash in Hx's (mostly for consistent formatting)

Find : <([Hh][1-6])>(.*?)\s*[-—–]{1,}\s*(.*?)</\1>

Replace: <\1>\2 \3</\1>

The once in awhile problem occurs when there are 2 or more em, en, or dash in the same Hx

Code:

<h1>fasfasdsadf –    asdfsdfsd — sdafasdasd - asasdf - asdsadf - asdasdf</h1>

Is there are a way to have the RE do them all, or do I still have to do [Replace All] until 0 are found?

eschwartz · 03-12-2015, 10:12 PM

Code:

<([Hh][1-6])>(.*?)\s*(?:[-—–]{1,}\s*(.*?))+</\1>

Repeating a capturing group

phossler · 03-13-2015, 09:29 AM

Did I do something wrong? I see how the repeat capture group works (or at least I think I do), but the Replace is not what I was expecting

It seems to replace too much, and all I was trying to do was end up with the fourth line in the picture. The F&R generated the first line.

phossler · 03-13-2015, 10:04 AM

Also I tried a FR function

Code:

import regex
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
  	return match.group().replace('–',' ').replace('—',' ').replace('-',' ').replace(' {2,}',' ')

which actually seems to replace the dashes, but then the remove multiple spaces piece on the end doesn't seem to do anything

So I suspect that I'm missing something fundamental here

eschwartz · 03-13-2015, 10:05 AM

Err, good point.

It will only capture the last match.

You could search for that match multiple times though, and use a "?" to make all but the first optional.

jbacelar · 03-13-2015, 12:07 PM

FuntcionRegex:

Search:
<([Hh][1-6])>.+?<

Code:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
return match.group().replace("@-@","@").replace("@–@","@").replace("@—@","@")

Naturally must change @ by space

phossler · 03-15-2015, 09:27 PM

This is what I have now, and it works, but just looks ugly.

It replaces the em, en, and dashes in Hx's, even multiples, and then shrinks multiple spaces to a single space (up to 10)

Is there a way to make it a little more elegant (and maintainable)?

Find:

<([Hh][1-6])>(.*?)</\1>

Code:

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    return match.group().replace("-"," ").replace("–"," ").replace("—"," ").replace ("  "," ").replace ("  "," ").replace ("  "," ").replace ("  "," ").replace ("  "," ").replace ("  "," ").replace ("  "," ").replace ("  "," ").replace ("  "," ").replace ("  "," ").replace ("  "," ")

jbacelar · 03-16-2015, 03:14 AM

I do not know which is the layout of dashes or spaces (and quantity) into your text, but I think something like what I propose (or similar) should work, (up to 10 spaces).

Code:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
return match.group().replace('@@','@').replace('-','@').replace('–','@').replace('—','@').replace(' @@@','@').replace('@@','@')

DiapDealer · 03-16-2015, 07:27 AM

Find:

Code:

<([Hh][1-6])([^>]*)>(.*?)</\1>

Code:

import regex
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    text_str = regex.sub(r'''[-–—]''', ' ', match.group(3))
    text_str = regex.sub(r''' {2,}''', ' ', text_str)
    return '<{0}{1}>{2}</{0}>'.format(match.group(1), match.group(2), text_str)

I don't know about graceful (I've always been strangely comforted by the aesthetics of the python one-liner, myself), but the above won't have the ten-space-or-less limitation.

** added the possibility to work with header tags that may have attributes.

jbacelar · 03-16-2015, 01:35 PM

@DiapDealer

Definitivo

phossler · 03-16-2015, 08:39 PM

@DiapDealer

I will study the technique since I can see many more places I can save myself some tedious work

phossler · 03-17-2015, 08:51 PM

@DiapDealer --

Could you please explain the syntax, grammar, and punctuation of the function?

I read this ...

https://docs.python.org/2.7/library/...ght=sub#re.sub

but still don't get it

The match.group(3) and the space{2,} I recognize, then things like

r'''something''' i.e. why the r and 3 single quotes?

return .... i.e. I can figure out the {0}, etc. but why in ' ....' and what is the .format for?

In case you haven't realized, my understanding of python is zilch

I am trying to figure out enough to cookbook some other functions

Thanks

eschwartz · 03-17-2015, 09:25 PM

r'''something''' -- see: https://docs.python.org/2.0/ref/strings.html

.format() acts on a string, and takes x arguments. For each argument, insert the value into the original string, replacing {n}.

DiapDealer · 03-17-2015, 10:00 PM

The r''' ''' is probably overkill in this situation, but I've gotten into the habit of using them all the time for regex expressions in python. '[-–—]' or "[-–—]" would achieve the same thing in this particular instance. It's still just a string representation of the regex expression.

Code:

text_str = regex.sub(r'''[-–—]''', ' ', match.group(3))

regex.substitute('everything matching this expression', with 'this', in 'this string')
Find all occurrences of - or – or — and replace them with a space in the string contained in the 3rd matching group. Store the results in text_str.

Code:

text_str = regex.sub(r''' {2,}''', ' ', text_str)

Find all occurrences of two or more consecutive spaces and replace them with a single space in the text_str string. Store the results in text_str.

Code:

return '<{0}{1}>{2}</{0}>'.format(match.group(1), match.group(2), text_str)

String formatting/substitution.

Code:

'Hello {0}'.format('there')

Substitute {0} with 'there'

Code:

'Hello {0} {1} {2}, {0}'.format('there', 'you', 10)

Becomes 'Hello there you 10, there.'

You don't even need to use numbers if you're not going to repeat anything:

Code:

'Hello {} {} {}, {}'.format('there', 'you', 10, 'you')

You could also use string concatenation:

Code:

return match.group(1) + match.group(2) + text_str + match.group(1)

But then you have to worry about making sure everything is represented properly as a string beforehand. Probably not necessary in this case, but again, just a habit I've gotten into to avoid type mismatches (plus I just like it better than the %s %d string substitution method

)

Code:

return '<%s%s>%s</%s>' % (match.group(1), match.group(2), text_str, match.group(1))

In this particular case:

Code:

return '<{0}{1}>{2}</{0}>'.format(match.group(1), match.group(2), text_str)

match.group(1) will be the tag name (h1, h2, h3, etc) and gets plugged into both {0}s.
match.group(2) will be any (optional) attributes (class="foo") and gets plugged into {1}.
text_str is our manipulated content from between the h-tags and gets plugged into {2}

eschwartz · 03-17-2015, 11:29 PM

Also because going forward, the format function is recommended -- for that very reason of course.

03-12-2015, 11:31 AM	#1
phossler Wizard Posts: 1,087 Karma: 447222 Join Date: Jan 2009 Location: Valley Forge, PA, USA Device: Kindle Paperwhite	RegEx question about repeating I have simple stored search RE to replace em, en, and dash in Hx's (mostly for consistent formatting) Find : <([Hh][1-6])>(.?)\s[-—–]{1,}\s(.?)</\1> Replace: <\1>\2 \3</\1> The once in awhile problem occurs when there are 2 or more em, en, or dash in the same Hx Code: <h1>fasfasdsadf – asdfsdfsd — sdafasdasd - asasdf - asdsadf - asdasdf</h1> Is there are a way to have the RE do them all, or do I still have to do [Replace All] until 0 are found? Last edited by phossler; 03-12-2015 at 03:16 PM. Reason: Supposed to be Hx and not just H1s

03-12-2015, 10:12 PM	#2
eschwartz Ex-Helpdesk Junkie Posts: 19,421 Karma: 85400180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	Code: <([Hh][1-6])>(.?)\s(?:[-—–]{1,}\s(.?))+</\1> Repeating a capturing group

03-13-2015, 09:29 AM	#3
phossler Wizard Posts: 1,087 Karma: 447222 Join Date: Jan 2009 Location: Valley Forge, PA, USA Device: Kindle Paperwhite	Did I do something wrong? I see how the repeat capture group works (or at least I think I do), but the Replace is not what I was expecting It seems to replace too much, and all I was trying to do was end up with the fourth line in the picture. The F&R generated the first line. Attached Thumbnails

03-13-2015, 10:04 AM	#4
phossler Wizard Posts: 1,087 Karma: 447222 Join Date: Jan 2009 Location: Valley Forge, PA, USA Device: Kindle Paperwhite	Also I tried a FR function Code: import regex def replace(match, number, file_name, metadata, dictionaries, data, functions, args, *kwargs): return match.group().replace('–',' ').replace('—',' ').replace('-',' ').replace(' {2,}',' ') which actually seems to replace the dashes, but then the remove multiple spaces piece on the end doesn't seem to do anything So I suspect that I'm missing something fundamental here

03-13-2015, 12:07 PM	#6
jbacelar Interested in the matter Posts: 421 Karma: 426094 Join Date: Dec 2011 Location: Spain, south coast Device: Pocketbook InkPad 3	FuntcionRegex: Search: <([Hh][1-6])>.+?< Code: def replace(match, number, file_name, metadata, dictionaries, data, functions, args, kwargs): return match.group().replace("@-@","@").replace("@–@","@").replace("@—@","@") Naturally must change @ by space Last edited by jbacelar; 03-13-2015 at 12:49 PM.*

03-13-2015, 10:05 AM	#5
eschwartz Ex-Helpdesk Junkie Posts: 19,421 Karma: 85400180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	Err, good point. It will only capture the last match. You could search for that match multiple times though, and use a "?" to make all but the first optional.

03-15-2015, 09:27 PM	#7
phossler Wizard Posts: 1,087 Karma: 447222 Join Date: Jan 2009 Location: Valley Forge, PA, USA Device: Kindle Paperwhite	This is what I have now, and it works, but just looks ugly. It replaces the em, en, and dashes in Hx's, even multiples, and then shrinks multiple spaces to a single space (up to 10) Is there a way to make it a little more elegant (and maintainable)? Find: <([Hh][1-6])>(.?)</\1> Code: def replace(match, number, file_name, metadata, dictionaries, data, functions, args, **kwargs): return match.group().replace("-"," ").replace("–"," ").replace("—"," ").replace (" "," ").replace (" "," ").replace (" "," ").replace (" "," ").replace (" "," ").replace (" "," ").replace (" "," ").replace (" "," ").replace (" "," ").replace (" "," ").replace (" "," ")

03-16-2015, 03:14 AM	#8
jbacelar Interested in the matter Posts: 421 Karma: 426094 Join Date: Dec 2011 Location: Spain, south coast Device: Pocketbook InkPad 3	I do not know which is the layout of dashes or spaces (and quantity) into your text, but I think something like what I propose (or similar) should work, (up to 10 spaces). Code: def replace(match, number, file_name, metadata, dictionaries, data, functions, args, *kwargs): return match.group().replace('@@','@').replace('-','@').replace('–','@').replace('—','@').replace(' @@@','@').replace('@@','@')

03-16-2015, 07:27 AM	#9
DiapDealer Grand Sorcerer Posts: 28,559 Karma: 204127028 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	Find: Code: <([Hh][1-6])([^>])>(.?)</\1> Code: import regex def replace(match, number, file_name, metadata, dictionaries, data, functions, args, kwargs): text_str = regex.sub(r'''[-–—]''', ' ', match.group(3)) text_str = regex.sub(r''' {2,}''', ' ', text_str) return '<{0}{1}>{2}</{0}>'.format(match.group(1), match.group(2), text_str) I don't know about graceful* (I've always been strangely comforted by the aesthetics of the python one-liner, myself), but the above won't have the ten-space-or-less limitation. ** added the possibility to work with header tags that may have attributes. Last edited by DiapDealer; 03-16-2015 at 07:50 AM.

03-16-2015, 01:35 PM	#10
jbacelar Interested in the matter Posts: 421 Karma: 426094 Join Date: Dec 2011 Location: Spain, south coast Device: Pocketbook InkPad 3	@DiapDealer Definitivo

03-16-2015, 08:39 PM	#11
phossler Wizard Posts: 1,087 Karma: 447222 Join Date: Jan 2009 Location: Valley Forge, PA, USA Device: Kindle Paperwhite	@DiapDealer I will study the technique since I can see many more places I can save myself some tedious work

03-17-2015, 08:51 PM	#12
phossler Wizard Posts: 1,087 Karma: 447222 Join Date: Jan 2009 Location: Valley Forge, PA, USA Device: Kindle Paperwhite	@DiapDealer -- Could you please explain the syntax, grammar, and punctuation of the function? I read this ... https://docs.python.org/2.7/library/...ght=sub#re.sub but still don't get it The match.group(3) and the space{2,} I recognize, then things like r'''something''' i.e. why the r and 3 single quotes? return .... i.e. I can figure out the {0}, etc. but why in ' ....' and what is the .format for? In case you haven't realized, my understanding of python is zilch I am trying to figure out enough to cookbook some other functions Thanks

03-17-2015, 09:25 PM	#13
eschwartz Ex-Helpdesk Junkie Posts: 19,421 Karma: 85400180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	r'''something''' -- see: https://docs.python.org/2.0/ref/strings.html .format() acts on a string, and takes x arguments. For each argument, insert the value into the original string, replacing {n}.

03-17-2015, 10:00 PM	#14
DiapDealer Grand Sorcerer Posts: 28,559 Karma: 204127028 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	The r''' ''' is probably overkill in this situation, but I've gotten into the habit of using them all the time for regex expressions in python. '[-–—]' or "[-–—]" would achieve the same thing in this particular instance. It's still just a string representation of the regex expression. Code: text_str = regex.sub(r'''[-–—]''', ' ', match.group(3)) regex.substitute('everything matching this expression', with 'this', in 'this string') Find all occurrences of - or – or — and replace them with a space in the string contained in the 3rd matching group. Store the results in text_str. Code: text_str = regex.sub(r''' {2,}''', ' ', text_str) Find all occurrences of two or more consecutive spaces and replace them with a single space in the text_str string. Store the results in text_str. Code: return '<{0}{1}>{2}</{0}>'.format(match.group(1), match.group(2), text_str) String formatting/substitution. Code: 'Hello {0}'.format('there') Substitute {0} with 'there' Code: 'Hello {0} {1} {2}, {0}'.format('there', 'you', 10) Becomes 'Hello there you 10, there.' You don't even need to use numbers if you're not going to repeat anything: Code: 'Hello {} {} {}, {}'.format('there', 'you', 10, 'you') You could also use string concatenation: Code: return match.group(1) + match.group(2) + text_str + match.group(1) But then you have to worry about making sure everything is represented properly as a string beforehand. Probably not necessary in this case, but again, just a habit I've gotten into to avoid type mismatches (plus I just like it better than the %s %d string substitution method ) Code: return '<%s%s>%s</%s>' % (match.group(1), match.group(2), text_str, match.group(1)) In this particular case: Code: return '<{0}{1}>{2}</{0}>'.format(match.group(1), match.group(2), text_str) match.group(1) will be the tag name (h1, h2, h3, etc) and gets plugged into both {0}s. match.group(2) will be any (optional) attributes (class="foo") and gets plugged into {1}. text_str is our manipulated content from between the h-tags and gets plugged into {2} Last edited by DiapDealer; 03-17-2015 at 10:11 PM.

03-17-2015, 11:29 PM	#15
eschwartz Ex-Helpdesk Junkie Posts: 19,421 Karma: 85400180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	Also because going forward, the format function is recommended -- for that very reason of course.

Advert

Advert

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
regex question	DrChiper	Editor	14	11-22-2014 04:27 AM
RegEx question (again)	phossler	Sigil	12	01-20-2013 02:37 PM
Yet another regex question	Jabby	Sigil	8	01-30-2012 08:41 PM
Regex question and maybe some help	crutledge	Sigil	9	03-10-2011 04:37 PM
Regex Question	Archon	Conversion	11	02-05-2011 10:13 AM