View Single Post
Old 07-17-2012, 08:30 PM   #10
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by DiapDealer View Post
Do you want it to exclude any bullet that occurs anywhere in any string that is enclosed by those "p class='calibre1'" tags? That could prove pretty tough, if so. I know I can't get my head around the expression to accomplish that (not that THAT renders it impossible by any means ).

I'd say your best bet is to isolate/alter the scene-breaks first and then catch any possible OCR glitches in a subsequent search.
I agree. I was sitting here for the past 40 minutes trying to wrap my head around a regular expression to handle all the situations at once.... and was getting stuck at being able to handle Scene Breaks.

I would follow the advice of mmat1 and temporarily replace all Scene Break '•'s, then go off fixing all the • with a normal search/replace. If you want to be lazy, I narrowed it down to two regexes:

You can Search/Replace with \1\2:

Code:
(<p class="calibre1">[^•]*)([^<]+</p>)
*** Not very efficient, but gets the job done ***

Red: "In a p with class calibre1" grab 0 or more characters that are NOT '•'.
  • This handles cases where the • is the first character in the paragraph, or in between somewhere.
  • Is the highly inefficient part. Will grab EVERYTHING until it hits a •.

Middle part: Finds the •.

Blue: Grabs the rest until it hits a </p>

Code:
([^>])•(</p>)
The second regex will grab all the ones in which • is the last character of the paragraph, while failing on your scene breaks.

By the way, here he was the example sentences I came up with and was working on:

Code:
<p class="calibre1">•</p>

<p class="calibre1">•A</p>

<p class="calibre1">A•</p>

<p class="calibre1">This has none.</p>

<p class="calibre1">This has none.</p>

<p class="calibre1">• This is a test sentence.</p>

<p class="calibre1">•</p>

<p class="calibre1">This is a test sentence. •</p>

<p class="calibre1">This is two test sentences. • This is two test sentences.</p>
Also, whenever you would like help on regular expressions, it would help to get lots of test cases. It took me a while to wrap my head around exactly what you were asking.

Last edited by Tex2002ans; 07-17-2012 at 08:33 PM.
Tex2002ans is offline   Reply With Quote