![]() |
#1 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,087
Karma: 447222
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
|
RegEx question about repeating
I have simple stored search RE to replace em, en, and dash in Hx's (mostly for consistent formatting)
Find : <([Hh][1-6])>(.*?)\s*[-—–]{1,}\s*(.*?)</\1> Replace: <\1>\2 \3</\1> The once in awhile problem occurs when there are 2 or more em, en, or dash in the same Hx Code:
<h1>fasfasdsadf – asdfsdfsd — sdafasdasd - asasdf - asdsadf - asdasdf</h1> Last edited by phossler; 03-12-2015 at 03:16 PM. Reason: Supposed to be Hx and not just H1s |
![]() |
![]() |
![]() |
#2 |
Ex-Helpdesk Junkie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 19,421
Karma: 85400180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
|
Code:
<([Hh][1-6])>(.*?)\s*(?:[-—–]{1,}\s*(.*?))+</\1> ![]() |
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,087
Karma: 447222
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
|
Did I do something wrong? I see how the repeat capture group works (or at least I think I do), but the Replace is not what I was expecting
It seems to replace too much, and all I was trying to do was end up with the fourth line in the picture. The F&R generated the first line. |
![]() |
![]() |
![]() |
#4 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,087
Karma: 447222
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
|
Also I tried a FR function
Code:
import regex def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs): return match.group().replace('–',' ').replace('—',' ').replace('-',' ').replace(' {2,}',' ') So I suspect that I'm missing something fundamental here ![]() |
![]() |
![]() |
![]() |
#5 |
Ex-Helpdesk Junkie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 19,421
Karma: 85400180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
|
Err, good point.
![]() You could search for that match multiple times though, and use a "?" to make all but the first optional. |
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Interested in the matter
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 421
Karma: 426094
Join Date: Dec 2011
Location: Spain, south coast
Device: Pocketbook InkPad 3
|
FuntcionRegex:
Search: <([Hh][1-6])>.+?< Code: def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs): return match.group().replace("@-@","@").replace("@–@","@").replace("@—@","@") Naturally must change @ by space ![]() Last edited by jbacelar; 03-13-2015 at 12:49 PM. |
![]() |
![]() |
![]() |
#7 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,087
Karma: 447222
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
|
This is what I have now, and it works, but just looks ugly.
It replaces the em, en, and dashes in Hx's, even multiples, and then shrinks multiple spaces to a single space (up to 10) Is there a way to make it a little more elegant (and maintainable)? Find: <([Hh][1-6])>(.*?)</\1> Code:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs): return match.group().replace("-"," ").replace("–"," ").replace("—"," ").replace (" "," ").replace (" "," ").replace (" "," ").replace (" "," ").replace (" "," ").replace (" "," ").replace (" "," ").replace (" "," ").replace (" "," ").replace (" "," ").replace (" "," ") |
![]() |
![]() |
![]() |
#8 |
Interested in the matter
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 421
Karma: 426094
Join Date: Dec 2011
Location: Spain, south coast
Device: Pocketbook InkPad 3
|
I do not know which is the layout of dashes or spaces (and quantity) into your text, but I think something like what I propose (or similar) should work, (up to 10 spaces).
Code: def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs): return match.group().replace('@@','@').replace('-','@').replace('–','@').replace('—','@').replace(' @@@','@').replace('@@','@') |
![]() |
![]() |
![]() |
#9 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 28,559
Karma: 204127028
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Find:
Code:
<([Hh][1-6])([^>]*)>(.*?)</\1> Code:
import regex def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs): text_str = regex.sub(r'''[-–—]''', ' ', match.group(3)) text_str = regex.sub(r''' {2,}''', ' ', text_str) return '<{0}{1}>{2}</{0}>'.format(match.group(1), match.group(2), text_str) ![]() ** added the possibility to work with header tags that may have attributes. Last edited by DiapDealer; 03-16-2015 at 07:50 AM. |
![]() |
![]() |
![]() |
#10 |
Interested in the matter
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 421
Karma: 426094
Join Date: Dec 2011
Location: Spain, south coast
Device: Pocketbook InkPad 3
|
@DiapDealer
Definitivo ![]() |
![]() |
![]() |
![]() |
#11 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,087
Karma: 447222
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
|
@DiapDealer
![]() ![]() ![]() I will study the technique since I can see many more places I can save myself some tedious work |
![]() |
![]() |
![]() |
#12 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,087
Karma: 447222
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
|
@DiapDealer --
Could you please explain the syntax, grammar, and punctuation of the function? I read this ... https://docs.python.org/2.7/library/...ght=sub#re.sub but still don't get it The match.group(3) and the space{2,} I recognize, then things like r'''something''' i.e. why the r and 3 single quotes? return .... i.e. I can figure out the {0}, etc. but why in ' ....' and what is the .format for? In case you haven't realized, my understanding of python is zilch I am trying to figure out enough to cookbook some other functions Thanks |
![]() |
![]() |
![]() |
#13 |
Ex-Helpdesk Junkie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 19,421
Karma: 85400180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
|
r'''something''' -- see: https://docs.python.org/2.0/ref/strings.html
.format() acts on a string, and takes x arguments. For each argument, insert the value into the original string, replacing {n}. |
![]() |
![]() |
![]() |
#14 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 28,559
Karma: 204127028
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
The r''' ''' is probably overkill in this situation, but I've gotten into the habit of using them all the time for regex expressions in python. '[-–—]' or "[-–—]" would achieve the same thing in this particular instance. It's still just a string representation of the regex expression.
Code:
text_str = regex.sub(r'''[-–—]''', ' ', match.group(3)) Find all occurrences of - or – or — and replace them with a space in the string contained in the 3rd matching group. Store the results in text_str. Code:
text_str = regex.sub(r''' {2,}''', ' ', text_str) Code:
return '<{0}{1}>{2}</{0}>'.format(match.group(1), match.group(2), text_str) Code:
'Hello {0}'.format('there') Code:
'Hello {0} {1} {2}, {0}'.format('there', 'you', 10) You don't even need to use numbers if you're not going to repeat anything: Code:
'Hello {} {} {}, {}'.format('there', 'you', 10, 'you') Code:
return match.group(1) + match.group(2) + text_str + match.group(1) ![]() Code:
return '<%s%s>%s</%s>' % (match.group(1), match.group(2), text_str, match.group(1)) Code:
return '<{0}{1}>{2}</{0}>'.format(match.group(1), match.group(2), text_str) match.group(2) will be any (optional) attributes (class="foo") and gets plugged into {1}. text_str is our manipulated content from between the h-tags and gets plugged into {2} Last edited by DiapDealer; 03-17-2015 at 10:11 PM. |
![]() |
![]() |
![]() |
#15 |
Ex-Helpdesk Junkie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 19,421
Karma: 85400180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
|
Also because going forward, the format function is recommended -- for that very reason of course.
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
regex question | DrChiper | Editor | 14 | 11-22-2014 04:27 AM |
RegEx question (again) | phossler | Sigil | 12 | 01-20-2013 02:37 PM |
Yet another regex question | Jabby | Sigil | 8 | 01-30-2012 08:41 PM |
Regex question and maybe some help | crutledge | Sigil | 9 | 03-10-2011 04:37 PM |
Regex Question | Archon | Conversion | 11 | 02-05-2011 10:13 AM |