Mathch a string while ignoring some character in that string?

ElMiko · 12-01-2011, 12:34 PM

So...

I'm cleanining up a book which has added title headings to the body of the text so that it looks like this:

Code:

<p>We were walking down the street when</p>

<p>THIS IS THE BOOK TITLE</p>

<p>we saw a squirrel sleeping in the middle of the road.</p>

Given the number of words in the title, and the fact that it is in all caps, this would generally be an easy fix. Unfortunately, the title has spaces thrown into it randomly so that it will look like:

Code:

THI S IS THE B OOK TITLE
or
THIS IS THE BO OK TI TLE
or
THIS I S THE BOOK TITLE
or
THIS IS THE B O O K TITLE
....etc

Is there any way to get match by matching the letters in the string while ignoring the spaces? And furthermore is it possible if the title is a mix of uppercase and lowercase?

theducks · 12-01-2011, 01:11 PM

Quote:

Originally Posted by ElMiko

So...

I'm cleanining up a book which has added title headings to the body of the text so that it looks like this:

Code:

<p>We were walking down the street when</p>

<p>THIS IS THE BOOK TITLE</p>

<p>we saw a squirrel sleeping in the middle of the road.</p>

Given the number of words in the title, and the fact that it is in all caps, this would generally be an easy fix. Unfortunately, the title has spaces thrown into it randomly so that it will look like:

Code:

THI S IS THE B OOK TITLE
or
THIS IS THE BO OK TI TLE
or
THIS I S THE BOOK TITLE
or
THIS IS THE B O O K TITLE
....etc

Is there any way to get match by matching the letters in the string while ignoring the spaces? And furthermore is it possible if the title is a mix of uppercase and lowercase?

are you trying to fix or remove this paragraph?
Uppercase only inside a p tag pair is fairly easy to trap and remove.
Mixed case garbage

Set Case Sensitive Mode

Code:

<p>([A-Z])?| )+</p>\s+

Note the vertical bar(space)
Not tested. use care. Abort (discard) if

should kill only the line with all caps and spaces

ElMiko · 12-01-2011, 01:34 PM

Between the the typo-riddled thread title and forgetting to say how I wanted to fix my problem, I'm really banging on all cylinders today... For the record,I am trying to removed the title heading.

The code you gave me didn't come up with any hits, but thankfully I think you gave me the snippet that will help me solve my problem (albeit in my own particularly unartful way): "(| )".

If i do a search for:

Code:

<p>T(| )H(| )I(| )S(| )I(| )S(| )T(| )H(| )E(| )B(| )O(| )O(| )K(| )T(| )I(| )T(| )L(| )E(| )</p>

That should recognize the any of the variations on "THIS IS THE BOOK TITLE" that I listed earlier, including mixed upper- and lowercase (if I leave "Match case" unchecked), right?

---
EDIT: So far so good. My 'puter hasn't exploded. Thanks (as always) for pointing me in the right direct, theducks! I'm still curious (though no longer desperately curious) whether there are neater ways to write that expression (one that would be case inclusive, and one the would be case exclusive)...

Serpentine · 12-01-2011, 01:43 PM

You could always be lazy and just use something like :

Code:

<(\w+)\b[^>]*>[TISHEBOKL\s]{5,}</\1>

Use minimal searching and it should be fine, I'd grep before letting it loose - check for any stray hits.

ElMiko · 12-01-2011, 03:45 PM

Thanks, Serpentine.

Another related question, when you have more than 9 parenthetically isolated expressions, how do you refer to the ones from 10 onward? For example, if I write a replace value of hello \10, it will produce "hello [whatever was in the first parenthetical expression]0" instead of "hello [whatever was in the tenth parenthetical expression]".

Serpentine · 12-01-2011, 04:14 PM

Quote:

Originally Posted by ElMiko

Another related question, when you have more than 9 parenthetically isolated expressions, how do you refer to the ones from 10 onward?

There doesn't seem to be any mention of this limit in the relevant Qt documentation, however most regex implementations work as you would expect. In this case, I would suggest removing capturing groups that you are not using, by making them into non-capturing groups.

Code:

Capturing :     (Capture( the (third) word))     // The word 'third' is group 3
Non-capturing : (?:Capture(?: the (third) word)) // The word 'third' is group 1

Non-capturing groups work exactly like normal groups, except that they are not returned.

sellew · 12-01-2011, 05:01 PM

Yes, I'm afraid I was too optimistic when wrote 'you can use as many groups as required'. Googling a bit I've read somewhere that the maximum number of back-references allowed by most regex engines is 9 (\1...\9).

ElMiko · 12-01-2011, 06:29 PM

Quote:

Originally Posted by Serpentine

There doesn't seem to be any mention of this limit in the relevant Qt documentation, however most regex implementations work as you would expect. In this case, I would suggest removing capturing groups that you are not using, by making them into non-capturing groups.

Code:

Capturing :     (Capture( the (third) word))     // The word 'third' is group 3
Non-capturing : (?:Capture(?: the (third) word)) // The word 'third' is group 1

Non-capturing groups work exactly like normal groups, except that they are not returned.

Have you confirmed this? Because this was one of the things i tried first and it didn't seem to make a difference. Just tried it again, and still no difference.

EDIT: Although this would still be useful information to have, I have found a work-around for my current problem. I just replace the variable text (through a search that uses (| )) with a consistent text. Thus eliminating all the parentheticals, before I do another search/replace that can use prentheticals expression without being overloaded by the all the instances of (| ).

Serpentine · 12-01-2011, 06:51 PM

Quote:

Originally Posted by ElMiko

Have you confirmed this?

Yeah, I just tested it - works correctly for me.

If you can give the pattern and perhaps a sample+expectation, I'll have a look.

ElMiko · 12-01-2011, 07:38 PM

Quote:

Originally Posted by Serpentine

Yeah, I just tested it - works correctly for me.

If you can give the pattern and perhaps a sample+expectation, I'll have a look.

I want to:

Spoiler:

the search/replace i do is:

Spoiler:

And what I keep getting is:

Spoiler:

Serpentine · 12-01-2011, 08:30 PM

Replace the last ([a-z]) with (?=[a-z])

Are you converting from PDF? it's usually easier to rename all of the paragraph/book titles that are repeated at page breaks to something easy to find, from there you can easily search for that and join the two paragraphs around it if needed.

ElMiko · 12-01-2011, 08:58 PM

@Serpentine - Thanks. Two follow-ups:

1) could you explain the code change?
2) converting from pdf, how would i go about following your advice?

Serpentine · 12-01-2011, 10:05 PM

Quote:

Originally Posted by ElMiko

1) could you explain the code change?

([a-z])
Match a single character from a-z; store the match as a group match. Since that character was then part of the match 't' in your case, it would be replaced.
(?=[a-z])
Lookahead, (?=...)
The following pattern should be found ahead, but is not actually part of the match, i.e it matches everything up until that point, then says, 'is the next character from a-z?'. Since this is not actually part of the match, the replacement does what you want.

Quote:

Originally Posted by ElMiko

2) converting from pdf, how would i go about following your advice?

Hmmm, I generally filter out empty paragraphs(like <p>(\s*|&nbsp

</p>) first, if you have recurring things like that badly formatted chapter heading, change it to something easy to see/match, i.e <p>REMOVE ME</p>. It's often useful to not remove them completely, like in this case they are useful for joining broken paragraphs.

12-01-2011, 12:34 PM	#1
ElMiko Addict Posts: 320 Karma: 56788 Join Date: Jun 2011 Device: Kindle	Match a string while ignoring some character in that string? So... I'm cleanining up a book which has added title headings to the body of the text so that it looks like this: Code: <p>We were walking down the street when</p> <p>THIS IS THE BOOK TITLE</p> <p>we saw a squirrel sleeping in the middle of the road.</p> Given the number of words in the title, and the fact that it is in all caps, this would generally be an easy fix. Unfortunately, the title has spaces thrown into it randomly so that it will look like: Code: THI S IS THE B OOK TITLE or THIS IS THE BO OK TI TLE or THIS I S THE BOOK TITLE or THIS IS THE B O O K TITLE ....etc Is there any way to get match by matching the letters in the string while ignoring the spaces? And furthermore is it possible if the title is a mix of uppercase and lowercase? Last edited by ElMiko; 12-01-2011 at 01:01 PM.

12-01-2011, 01:34 PM	#3
ElMiko Addict Posts: 320 Karma: 56788 Join Date: Jun 2011 Device: Kindle	Between the the typo-riddled thread title and forgetting to say how I wanted to fix my problem, I'm really banging on all cylinders today... For the record,I am trying to removed the title heading. The code you gave me didn't come up with any hits, but thankfully I think you gave me the snippet that will help me solve my problem (albeit in my own particularly unartful way): "(\| )". If i do a search for: Code: <p>T(\| )H(\| )I(\| )S(\| )I(\| )S(\| )T(\| )H(\| )E(\| )B(\| )O(\| )O(\| )K(\| )T(\| )I(\| )T(\| )L(\| )E(\| )</p> That should recognize the any of the variations on "THIS IS THE BOOK TITLE" that I listed earlier, including mixed upper- and lowercase (if I leave "Match case" unchecked), right? --- EDIT: So far so good. My 'puter hasn't exploded. Thanks (as always) for pointing me in the right direct, theducks! I'm still curious (though no longer desperately curious) whether there are neater ways to write that expression (one that would be case inclusive, and one the would be case exclusive)... Last edited by ElMiko; 12-01-2011 at 01:53 PM.

12-01-2011, 01:43 PM	#4
Serpentine Evangelist Posts: 416 Karma: 1045911 Join Date: Sep 2011 Location: Cape Town, South Africa Device: Kindle 3	You could always be lazy and just use something like : Code: <(\w+)\b[^>]*>[TISHEBOKL\s]{5,}</\1> Use minimal searching and it should be fine, I'd grep before letting it loose - check for any stray hits.

12-01-2011, 03:45 PM	#5
ElMiko Addict Posts: 320 Karma: 56788 Join Date: Jun 2011 Device: Kindle	Thanks, Serpentine. Another related question, when you have more than 9 parenthetically isolated expressions, how do you refer to the ones from 10 onward? For example, if I write a replace value of hello \10, it will produce "hello [whatever was in the first parenthetical expression]0" instead of "hello [whatever was in the tenth parenthetical expression]". Last edited by ElMiko; 12-01-2011 at 03:48 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
String freeze preparing for 0.5	user_none	Sigil	10	11-12-2011 03:49 PM
Text File String Replacer	bizzybody	Other formats	2	12-20-2010 05:20 AM
Browser's User Agent string	Polvo	Kindle Developer's Corner	0	11-18-2010 06:50 AM
Find and replace string with wildcard	jhempel24	Sigil	15	11-12-2010 01:50 PM
Error when inputting a search string in V0.6.37	solitaire	Calibre	1	02-05-2010 11:29 PM

12-01-2011, 05:01 PM	#7
sellew Enthusiast Posts: 30 Karma: 300 Join Date: Oct 2011 Location: Barcelona Device: Sony PRS-650, PRS-T2	Yes, I'm afraid I was too optimistic when wrote 'you can use as many groups as required'. Googling a bit I've read somewhere that the maximum number of back-references allowed by most regex engines is 9 (\1...\9).

12-01-2011, 08:30 PM	#11
Serpentine Evangelist Posts: 416 Karma: 1045911 Join Date: Sep 2011 Location: Cape Town, South Africa Device: Kindle 3	Replace the last ([a-z]) with (?=[a-z]) Are you converting from PDF? it's usually easier to rename all of the paragraph/book titles that are repeated at page breaks to something easy to find, from there you can easily search for that and join the two paragraphs around it if needed.

12-01-2011, 08:58 PM	#12
ElMiko Addict Posts: 320 Karma: 56788 Join Date: Jun 2011 Device: Kindle	@Serpentine - Thanks. Two follow-ups: 1) could you explain the code change? 2) converting from pdf, how would i go about following your advice?