![]() |
How to exclude strings before and after
1 Attachment(s)
trying to find instances of a character ("•") in a document where it might pop up as a scanning artifact rather than an intended character. Unfortunately, "•" is also used as a section break. What I'd like to do is search for instances of "•" when it is not enclosed by <p> tags.
My first impulse was to do a search for: Code:
(?<!<p class="calibre1">)•(?!</p>) |
I'm not sure I understand the question. You haven't made your look-behind or look-ahead assertions optional, so both should already be required. Your expression should already be logically correct. But I'm not certain whether you're looking for instances where the bullet IS or ISN'T enclosed with p tags.
IS Code:
(?<=<p class="calibre1">)•(?=</p>)Code:
(?<!<p class="calibre1">)•(?!</p>) |
Quote:
Code:
<p class="calibre1">•A</p> |
Quote:
After securing the passages with the bullets you want to keep this way, it should be easy to exchange your scan-errors with a simple s/r. |
Quote:
|
Quote:
Do you want it to exclude any bullet that occurs anywhere in any string that is enclosed by those "p class='calibre1'" tags? That could prove pretty tough, if so. I know I can't get my head around the expression to accomplish that (not that THAT renders it impossible by any means ;) ). I'd say your best bet is to isolate/alter the scene-breaks first and then catch any possible OCR glitches in a subsequent search. |
I would do 2 searches, first for [^>]• and then for •[^<].
None of them should find <p class="calibre1">•</p> |
Quote:
Code:
<p class="calibre1">A•</p>Code:
<p class="calibre1">•</p>Code:
<p>Here's some text <span>•</span>, ain't life grand?</p> |
That's tough. I can't think of a one-time expression that will accomplish that.
Seriously... I'd change all occurrences of Code:
<p class="calibre1">•</p>Code:
<p class="calibre1">_@</p>WS64's two pass suggestion could work too (and wouldn't require altering the existing scene breaks). EDIT: Never mind the alternate suggestions if you already worked around the problem. I understand looking for that "one regexp to bind them all", but I'm just not sure it's worth an extended quest for this one. ;) |
Quote:
I would follow the advice of mmat1 and temporarily replace all Scene Break '•'s, then go off fixing all the • with a normal search/replace. If you want to be lazy, I narrowed it down to two regexes: You can Search/Replace with \1\2: Code:
(<p class="calibre1">[^•]*)•([^<]+</p>)Red: "In a p with class calibre1" grab 0 or more characters that are NOT '•'.
Middle part: Finds the •. Blue: Grabs the rest until it hits a </p> Code:
([^>])•(</p>)By the way, here he was the example sentences I came up with and was working on: Code:
<p class="calibre1">•</p> |
Something like this perhaps?
Code:
(?!<p[^<>]*>•</p>)<([^\s]+)[^<>]*>[^<>]*•[^<>]*</\1> |
Quote:
|
Just for the record, DD's suggestion is closest to what I've done in the past... the only difference being in the symbol i used as a replacement (the fleur de lis instead of the "_@"). the reason i didn't include any other examples is becuase i thought that saying I want to exclude all instances of a bullet when it appears like this:
Code:
<p class="calibre1">•</p>In any case, I thought I'd float it here and see if anyone bit. I've run into the problem more than once in the past, and knew I'd basically totally played out my own knowledge of reg-ex. I really do appreciate you all rolling it around and giving it the old college try. PS - @Tex: as always i especially appreciate your breaking down your thought process and compartmentalizing the behavioral characteristics of your regex. Frankly, it's the other reason I posted my question: I've come to realize that whether or not I get the specific answer I'm looking for, I'll always come out of the thread knowing more about reg-ex than i did going into it. |
Quote:
Would also help save a little time by people who want to help, and make more accurate regexes. :) Quote:
I thought that \1 could only be used in the Replace, but in Serpentine's example I see it can be used in a Search as well. |
@ElMiko: This pattern should find all instances of the bullet character, excluding only the ones that you do not want.
Code:
(<p[^>]*?>)?\K•(?(1)(*SKIP)(?!</p>))(<p[^>]*?>)? : Match 1 or 0(also called optional match) p element opening tag, capture it if it exists. \K : Resets the start of match (anything matched before this will not be included in the match result). • : The match we are interested in, in our case it is the bullet character alone •. (?(1) : If we have successfully found an opening p tag in the first capture group process the following, otherwise ignore them altogether: (*SKIP) : advance the starting point in the next search iteration to here, (?!</p>) : if we find a closing of the p element at this point(also fails the match). ) Some more (wordy) explanations of the parts: (?(1)yes-pattern) : This is a form of conditional subpattern, it searches for the yes-pattern if the condition is true, otherwise it has no effect. The condition in this form is the check for match for the first capture group in our complete pattern. In our case it is the conditional match at the beginning of the pattern. Read as "If the first capture group has matched anything then look for yes-pattern here, if the first capture group is empty(remember that we have used an optional match, so no-match was acceptable to our pattern) then ignore this whole subpattern." yes-pattern in our example is (*SKIP)(?!</p>) (*SKIP) : Normally when a match fails at any point, the starting point in the source that will be next tried for the pattern is advanced one character. (*SKIP) backtracking control verb causes the point to be advanced to where (*SKIP) verb is encountered during the matching if the pattern in whole fails matching. This implies two things are needed for the (*SKIP) verb to have an effect; 1) everything before it has already matched successfully, 2) something in the part of the pattern that comes after it caused a fail in match. (?!</p>) : This is a classic negative lookahead pattern. If the subpattern is found at this point in the search, this causes a fail in the pattern. In our case it looks for </p>, and if this pattern is found here, (?!</p>) causes a fail in match, which in turn causes the (*SKIP) to have its effect and the starting point for the next search is advanced to the (*SKIP) point, which is the character just after the bullet, •. Here is a templated form(whitespaces in here are for ease of reading): (negative_lookbehind_pattern_simulator)? \Kwanted_pattern (?(1) (*SKIP) (?!negative_lookahed_pattern)) Note that unlike regular lookbehind patterns, this form allows for indefinite length matches, because technically it is not a lookbehind assertion but a basic match pattern that starts searching from the current character position; it behaves like a negative lookbehind by the help of the later conditional part of the template. This is the reason why I have labelled this pattern as a "simulator". Corollary: Here is a modified form of this template one can use if only a negative lookbehind alternative is needed, which allows for indefinite length matches. (negative_lookbehind_pattern_simulator)? \Kwanted_pattern (?(1) (*SKIP) (?!)) Here we use (?!) negative lookahead with an empty string as the subpattern. Empty string pattern always matches, hence its negative assertion always fails to match, hence (*SKIP) effect is achieved regardless of what comes after it as long as the first capture group is not empty. The reason for the existence of the (*SKIP) in the pattern is left as an exercise to the reader. The reason for why I am leaving it as an exercise is a sudden lazy spell. |
| All times are GMT -4. The time now is 07:53 PM. |
Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.