![]() |
#1 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 358
Karma: 65460
Join Date: Jun 2011
Device: Kindle
|
Match a string while ignoring some character in that string?
So...
I'm cleanining up a book which has added title headings to the body of the text so that it looks like this: Code:
<p>We were walking down the street when</p> <p>THIS IS THE BOOK TITLE</p> <p>we saw a squirrel sleeping in the middle of the road.</p> Code:
THI S IS THE B OOK TITLE or THIS IS THE BO OK TI TLE or THIS I S THE BOOK TITLE or THIS IS THE B O O K TITLE ....etc Last edited by ElMiko; 12-01-2011 at 01:01 PM. |
![]() |
![]() |
![]() |
#2 | |
Well trained by Cats
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 30,891
Karma: 59840954
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Quote:
Uppercase only inside a p tag pair is fairly easy to trap and remove. Mixed case garbage ![]() Set Case Sensitive Mode Code:
<p>([A-Z])?| )+</p>\s+ Not tested. use care. Abort (discard) if ![]() should kill only the line with all caps and spaces |
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 358
Karma: 65460
Join Date: Jun 2011
Device: Kindle
|
![]() The code you gave me didn't come up with any hits, but thankfully I think you gave me the snippet that will help me solve my problem (albeit in my own particularly unartful way): "(| )". If i do a search for: Code:
<p>T(| )H(| )I(| )S(| )I(| )S(| )T(| )H(| )E(| )B(| )O(| )O(| )K(| )T(| )I(| )T(| )L(| )E(| )</p> --- EDIT: So far so good. My 'puter hasn't exploded. Thanks (as always) for pointing me in the right direct, theducks! I'm still curious (though no longer desperately curious) whether there are neater ways to write that expression (one that would be case inclusive, and one the would be case exclusive)... Last edited by ElMiko; 12-01-2011 at 01:53 PM. |
![]() |
![]() |
![]() |
#4 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
|
You could always be lazy and just use something like :
Code:
<(\w+)\b[^>]*>[TISHEBOKL\s]{5,}</\1> |
![]() |
![]() |
![]() |
#5 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 358
Karma: 65460
Join Date: Jun 2011
Device: Kindle
|
Thanks, Serpentine.
Another related question, when you have more than 9 parenthetically isolated expressions, how do you refer to the ones from 10 onward? For example, if I write a replace value of hello \10, it will produce "hello [whatever was in the first parenthetical expression]0" instead of "hello [whatever was in the tenth parenthetical expression]". Last edited by ElMiko; 12-01-2011 at 03:48 PM. |
![]() |
![]() |
Advert | |
|
![]() |
#6 | |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
|
Quote:
Code:
Capturing : (Capture( the (third) word)) // The word 'third' is group 3 Non-capturing : (?:Capture(?: the (third) word)) // The word 'third' is group 1 Last edited by Serpentine; 12-01-2011 at 04:19 PM. Reason: code block |
|
![]() |
![]() |
![]() |
#7 |
Enthusiast
![]() ![]() ![]() ![]() Posts: 30
Karma: 300
Join Date: Oct 2011
Location: Barcelona
Device: Sony PRS-650, PRS-T2
|
Yes, I'm afraid I was too optimistic when wrote 'you can use as many groups as required'. Googling a bit I've read somewhere that the maximum number of back-references allowed by most regex engines is 9 (\1...\9).
|
![]() |
![]() |
![]() |
#8 | |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 358
Karma: 65460
Join Date: Jun 2011
Device: Kindle
|
Quote:
EDIT: Although this would still be useful information to have, I have found a work-around for my current problem. I just replace the variable text (through a search that uses (| )) with a consistent text. Thus eliminating all the parentheticals, before I do another search/replace that can use prentheticals expression without being overloaded by the all the instances of (| ). Last edited by ElMiko; 12-01-2011 at 06:56 PM. |
|
![]() |
![]() |
![]() |
#9 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
|
|
![]() |
![]() |
![]() |
#10 | |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 358
Karma: 65460
Join Date: Jun 2011
Device: Kindle
|
Quote:
Spoiler:
the search/replace i do is: Spoiler:
And what I keep getting is: Spoiler:
Last edited by ElMiko; 12-01-2011 at 07:43 PM. |
|
![]() |
![]() |
![]() |
#11 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
|
Replace the last ([a-z]) with (?=[a-z])
Are you converting from PDF? it's usually easier to rename all of the paragraph/book titles that are repeated at page breaks to something easy to find, from there you can easily search for that and join the two paragraphs around it if needed. |
![]() |
![]() |
![]() |
#12 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 358
Karma: 65460
Join Date: Jun 2011
Device: Kindle
|
@Serpentine - Thanks. Two follow-ups:
1) could you explain the code change? 2) converting from pdf, how would i go about following your advice? |
![]() |
![]() |
![]() |
#13 | |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
|
([a-z])
Match a single character from a-z; store the match as a group match. Since that character was then part of the match 't' in your case, it would be replaced. (?=[a-z]) Lookahead, (?=...) The following pattern should be found ahead, but is not actually part of the match, i.e it matches everything up until that point, then says, 'is the next character from a-z?'. Since this is not actually part of the match, the replacement does what you want. Quote:
![]() |
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
String freeze preparing for 0.5 | user_none | Sigil | 10 | 11-12-2011 03:49 PM |
Text File String Replacer | bizzybody | Other formats | 2 | 12-20-2010 05:20 AM |
Browser's User Agent string | Polvo | Kindle Developer's Corner | 0 | 11-18-2010 06:50 AM |
Find and replace string with wildcard | jhempel24 | Sigil | 15 | 11-12-2010 01:50 PM |
Error when inputting a search string in V0.6.37 | solitaire | Calibre | 1 | 02-05-2010 11:29 PM |