Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 12-01-2011, 12:34 PM   #1
ElMiko
Addict
ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.
 
ElMiko's Avatar
 
Posts: 320
Karma: 56788
Join Date: Jun 2011
Device: Kindle
Match a string while ignoring some character in that string?

So...

I'm cleanining up a book which has added title headings to the body of the text so that it looks like this:

Code:
<p>We were walking down the street when</p>

<p>THIS IS THE BOOK TITLE</p>

<p>we saw a squirrel sleeping in the middle of the road.</p>
Given the number of words in the title, and the fact that it is in all caps, this would generally be an easy fix. Unfortunately, the title has spaces thrown into it randomly so that it will look like:

Code:
THI S IS THE B OOK TITLE
or
THIS IS THE BO OK TI TLE
or
THIS I S THE BOOK TITLE
or
THIS IS THE B O O K TITLE
....etc
Is there any way to get match by matching the letters in the string while ignoring the spaces? And furthermore is it possible if the title is a mix of uppercase and lowercase?

Last edited by ElMiko; 12-01-2011 at 01:01 PM.
ElMiko is offline   Reply With Quote
Old 12-01-2011, 01:11 PM   #2
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,800
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by ElMiko View Post
So...

I'm cleanining up a book which has added title headings to the body of the text so that it looks like this:

Code:
<p>We were walking down the street when</p>

<p>THIS IS THE BOOK TITLE</p>

<p>we saw a squirrel sleeping in the middle of the road.</p>
Given the number of words in the title, and the fact that it is in all caps, this would generally be an easy fix. Unfortunately, the title has spaces thrown into it randomly so that it will look like:

Code:
THI S IS THE B OOK TITLE
or
THIS IS THE BO OK TI TLE
or
THIS I S THE BOOK TITLE
or
THIS IS THE B O O K TITLE
....etc
Is there any way to get match by matching the letters in the string while ignoring the spaces? And furthermore is it possible if the title is a mix of uppercase and lowercase?
are you trying to fix or remove this paragraph?
Uppercase only inside a p tag pair is fairly easy to trap and remove.
Mixed case garbage

Set Case Sensitive Mode
Code:
<p>([A-Z])?| )+</p>\s+
Note the vertical bar(space)
Not tested. use care. Abort (discard) if
should kill only the line with all caps and spaces
theducks is offline   Reply With Quote
Old 12-01-2011, 01:34 PM   #3
ElMiko
Addict
ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.
 
ElMiko's Avatar
 
Posts: 320
Karma: 56788
Join Date: Jun 2011
Device: Kindle
Between the the typo-riddled thread title and forgetting to say how I wanted to fix my problem, I'm really banging on all cylinders today... For the record,I am trying to removed the title heading.

The code you gave me didn't come up with any hits, but thankfully I think you gave me the snippet that will help me solve my problem (albeit in my own particularly unartful way): "(| )".

If i do a search for:

Code:
<p>T(| )H(| )I(| )S(| )I(| )S(| )T(| )H(| )E(| )B(| )O(| )O(| )K(| )T(| )I(| )T(| )L(| )E(| )</p>
That should recognize the any of the variations on "THIS IS THE BOOK TITLE" that I listed earlier, including mixed upper- and lowercase (if I leave "Match case" unchecked), right?

---
EDIT:
So far so good. My 'puter hasn't exploded. Thanks (as always) for pointing me in the right direct, theducks! I'm still curious (though no longer desperately curious) whether there are neater ways to write that expression (one that would be case inclusive, and one the would be case exclusive)...

Last edited by ElMiko; 12-01-2011 at 01:53 PM.
ElMiko is offline   Reply With Quote
Old 12-01-2011, 01:43 PM   #4
Serpentine
Evangelist
Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.
 
Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
You could always be lazy and just use something like :
Code:
<(\w+)\b[^>]*>[TISHEBOKL\s]{5,}</\1>
Use minimal searching and it should be fine, I'd grep before letting it loose - check for any stray hits.
Serpentine is offline   Reply With Quote
Old 12-01-2011, 03:45 PM   #5
ElMiko
Addict
ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.
 
ElMiko's Avatar
 
Posts: 320
Karma: 56788
Join Date: Jun 2011
Device: Kindle
Thanks, Serpentine.

Another related question, when you have more than 9 parenthetically isolated expressions, how do you refer to the ones from 10 onward? For example, if I write a replace value of hello \10, it will produce "hello [whatever was in the first parenthetical expression]0" instead of "hello [whatever was in the tenth parenthetical expression]".

Last edited by ElMiko; 12-01-2011 at 03:48 PM.
ElMiko is offline   Reply With Quote
Old 12-01-2011, 04:14 PM   #6
Serpentine
Evangelist
Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.
 
Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
Quote:
Originally Posted by ElMiko View Post
Another related question, when you have more than 9 parenthetically isolated expressions, how do you refer to the ones from 10 onward?
There doesn't seem to be any mention of this limit in the relevant Qt documentation, however most regex implementations work as you would expect. In this case, I would suggest removing capturing groups that you are not using, by making them into non-capturing groups.
Code:
Capturing :     (Capture( the (third) word))     // The word 'third' is group 3
Non-capturing : (?:Capture(?: the (third) word)) // The word 'third' is group 1
Non-capturing groups work exactly like normal groups, except that they are not returned.

Last edited by Serpentine; 12-01-2011 at 04:19 PM. Reason: code block
Serpentine is offline   Reply With Quote
Old 12-01-2011, 05:01 PM   #7
sellew
Enthusiast
sellew has a complete set of Star Wars action figures.sellew has a complete set of Star Wars action figures.sellew has a complete set of Star Wars action figures.sellew has a complete set of Star Wars action figures.
 
Posts: 30
Karma: 300
Join Date: Oct 2011
Location: Barcelona
Device: Sony PRS-650, PRS-T2
Yes, I'm afraid I was too optimistic when wrote 'you can use as many groups as required'. Googling a bit I've read somewhere that the maximum number of back-references allowed by most regex engines is 9 (\1...\9).
sellew is offline   Reply With Quote
Old 12-01-2011, 06:29 PM   #8
ElMiko
Addict
ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.
 
ElMiko's Avatar
 
Posts: 320
Karma: 56788
Join Date: Jun 2011
Device: Kindle
Quote:
Originally Posted by Serpentine View Post
There doesn't seem to be any mention of this limit in the relevant Qt documentation, however most regex implementations work as you would expect. In this case, I would suggest removing capturing groups that you are not using, by making them into non-capturing groups.
Code:
Capturing :     (Capture( the (third) word))     // The word 'third' is group 3
Non-capturing : (?:Capture(?: the (third) word)) // The word 'third' is group 1
Non-capturing groups work exactly like normal groups, except that they are not returned.
Have you confirmed this? Because this was one of the things i tried first and it didn't seem to make a difference. Just tried it again, and still no difference.

EDIT:
Although this would still be useful information to have, I have found a work-around for my current problem. I just replace the variable text (through a search that uses (| )) with a consistent text. Thus eliminating all the parentheticals, before I do another search/replace that can use prentheticals expression without being overloaded by the all the instances of (| ).

Last edited by ElMiko; 12-01-2011 at 06:56 PM.
ElMiko is offline   Reply With Quote
Old 12-01-2011, 06:51 PM   #9
Serpentine
Evangelist
Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.
 
Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
Quote:
Originally Posted by ElMiko View Post
Have you confirmed this?
Yeah, I just tested it - works correctly for me.

If you can give the pattern and perhaps a sample+expectation, I'll have a look.
Serpentine is offline   Reply With Quote
Old 12-01-2011, 07:38 PM   #10
ElMiko
Addict
ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.
 
ElMiko's Avatar
 
Posts: 320
Karma: 56788
Join Date: Jun 2011
Device: Kindle
Quote:
Originally Posted by Serpentine View Post
Yeah, I just tested it - works correctly for me.

If you can give the pattern and perhaps a sample+expectation, I'll have a look.
I want to:
Spoiler:

turn this (and all similar instances):

but</p>

<p class="calibre2"></p>

<p class="calibre2">The Gh oul G al l ery</p>

<p class="calibre2">the lights


into this (or its formatting equivalent):
but the lights


the search/replace i do is:

Spoiler:

SEARCH:
</p>[\s]+<p class="calibre2"></p>[\s]+<p class="calibre2">T((| )h(| )e(| )G(| )h(| )o(| )u(| )l(| )G(| )a(| )l(| )l(| )e(| )r(| ))y</p>[\s]+<p class="calibre2">([a-z])

REPLACE:
\2 -----> (there's a "space" before the backslash)


And what I keep getting is:
Spoiler:
but he lights ------> Note the lost "t"

Last edited by ElMiko; 12-01-2011 at 07:43 PM.
ElMiko is offline   Reply With Quote
Old 12-01-2011, 08:30 PM   #11
Serpentine
Evangelist
Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.
 
Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
Replace the last ([a-z]) with (?=[a-z])

Are you converting from PDF? it's usually easier to rename all of the paragraph/book titles that are repeated at page breaks to something easy to find, from there you can easily search for that and join the two paragraphs around it if needed.
Serpentine is offline   Reply With Quote
Old 12-01-2011, 08:58 PM   #12
ElMiko
Addict
ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.ElMiko actually enjoys Vogon poetry.
 
ElMiko's Avatar
 
Posts: 320
Karma: 56788
Join Date: Jun 2011
Device: Kindle
@Serpentine - Thanks. Two follow-ups:

1) could you explain the code change?
2) converting from pdf, how would i go about following your advice?
ElMiko is offline   Reply With Quote
Old 12-01-2011, 10:05 PM   #13
Serpentine
Evangelist
Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.
 
Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
Quote:
Originally Posted by ElMiko View Post
1) could you explain the code change?
([a-z])
Match a single character from a-z; store the match as a group match. Since that character was then part of the match 't' in your case, it would be replaced.
(?=[a-z])
Lookahead, (?=...)
The following pattern should be found ahead, but is not actually part of the match, i.e it matches everything up until that point, then says, 'is the next character from a-z?'. Since this is not actually part of the match, the replacement does what you want.

Quote:
Originally Posted by ElMiko View Post
2) converting from pdf, how would i go about following your advice?
Hmmm, I generally filter out empty paragraphs(like <p>(\s*|&nbsp</p>) first, if you have recurring things like that badly formatted chapter heading, change it to something easy to see/match, i.e <p>REMOVE ME</p>. It's often useful to not remove them completely, like in this case they are useful for joining broken paragraphs.
Serpentine is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
String freeze preparing for 0.5 user_none Sigil 10 11-12-2011 03:49 PM
Text File String Replacer bizzybody Other formats 2 12-20-2010 05:20 AM
Browser's User Agent string Polvo Kindle Developer's Corner 0 11-18-2010 06:50 AM
Find and replace string with wildcard jhempel24 Sigil 15 11-12-2010 01:50 PM
Error when inputting a search string in V0.6.37 solitaire Calibre 1 02-05-2010 11:29 PM


All times are GMT -4. The time now is 01:30 AM.


MobileRead.com is a privately owned, operated and funded community.