07-17-2012, 11:34 AM | #1 |
Addict
Posts: 320
Karma: 56788
Join Date: Jun 2011
Device: Kindle
|
How to exclude strings before and after
trying to find instances of a character ("•") in a document where it might pop up as a scanning artifact rather than an intended character. Unfortunately, "•" is also used as a section break. What I'd like to do is search for instances of "•" when it is not enclosed by <p> tags.
My first impulse was to do a search for: Code:
(?<!<p class="calibre1">)•(?!</p>) Last edited by ElMiko; 03-08-2013 at 07:54 PM. |
07-17-2012, 11:59 AM | #2 |
Grand Sorcerer
Posts: 27,463
Karma: 192992430
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
I'm not sure I understand the question. You haven't made your look-behind or look-ahead assertions optional, so both should already be required. Your expression should already be logically correct. But I'm not certain whether you're looking for instances where the bullet IS or ISN'T enclosed with p tags.
IS Code:
(?<=<p class="calibre1">)•(?=</p>) Code:
(?<!<p class="calibre1">)•(?!</p>) |
07-17-2012, 12:44 PM | #3 | |
Addict
Posts: 320
Karma: 56788
Join Date: Jun 2011
Device: Kindle
|
Quote:
Code:
<p class="calibre1">•A</p> <p class="calibre1">A•</p> Last edited by ElMiko; 07-17-2012 at 01:02 PM. |
|
07-17-2012, 01:10 PM | #4 | |
Berti
Posts: 1,196
Karma: 4985964
Join Date: Jan 2012
Location: Zischebattem
Device: Acer Lumiread
|
Quote:
After securing the passages with the bullets you want to keep this way, it should be easy to exchange your scan-errors with a simple s/r. |
|
07-17-2012, 01:15 PM | #5 | |
Addict
Posts: 320
Karma: 56788
Join Date: Jun 2011
Device: Kindle
|
Quote:
Last edited by ElMiko; 07-17-2012 at 01:19 PM. |
|
07-17-2012, 01:57 PM | #6 | |
Grand Sorcerer
Posts: 27,463
Karma: 192992430
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Quote:
Do you want it to exclude any bullet that occurs anywhere in any string that is enclosed by those "p class='calibre1'" tags? That could prove pretty tough, if so. I know I can't get my head around the expression to accomplish that (not that THAT renders it impossible by any means ). I'd say your best bet is to isolate/alter the scene-breaks first and then catch any possible OCR glitches in a subsequent search. Last edited by DiapDealer; 07-17-2012 at 02:11 PM. |
|
07-17-2012, 02:08 PM | #7 |
♫
Posts: 660
Karma: 506380
Join Date: Aug 2010
Location: Germany
Device: Kobo Aura / PB Lux 2 / Bookeen Frontlight / Kobo Mini / Nook Color
|
I would do 2 searches, first for [^>]• and then for •[^<].
None of them should find <p class="calibre1">•</p> |
07-17-2012, 07:43 PM | #8 | |
Addict
Posts: 320
Karma: 56788
Join Date: Jun 2011
Device: Kindle
|
Quote:
Code:
<p class="calibre1">A•</p> <p class="calibre1">•A</p> <p class="calibre1">A •BC D</p> Code:
<p class="calibre1">•</p> Code:
<p>Here's some text <span>•</span>, ain't life grand?</p> or <p>here's some more <i>•</i> text</p> Last edited by ElMiko; 07-17-2012 at 07:57 PM. |
|
07-17-2012, 08:01 PM | #9 |
Grand Sorcerer
Posts: 27,463
Karma: 192992430
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
That's tough. I can't think of a one-time expression that will accomplish that.
Seriously... I'd change all occurrences of Code:
<p class="calibre1">•</p> Code:
<p class="calibre1">_@</p> WS64's two pass suggestion could work too (and wouldn't require altering the existing scene breaks). EDIT: Never mind the alternate suggestions if you already worked around the problem. I understand looking for that "one regexp to bind them all", but I'm just not sure it's worth an extended quest for this one. Last edited by DiapDealer; 07-17-2012 at 08:05 PM. |
07-17-2012, 08:30 PM | #10 | |
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
I would follow the advice of mmat1 and temporarily replace all Scene Break '•'s, then go off fixing all the • with a normal search/replace. If you want to be lazy, I narrowed it down to two regexes: You can Search/Replace with \1\2: Code:
(<p class="calibre1">[^•]*)•([^<]+</p>) Red: "In a p with class calibre1" grab 0 or more characters that are NOT '•'.
Middle part: Finds the •. Blue: Grabs the rest until it hits a </p> Code:
([^>])•(</p>) By the way, here he was the example sentences I came up with and was working on: Code:
<p class="calibre1">•</p> <p class="calibre1">•A</p> <p class="calibre1">A•</p> <p class="calibre1">This has none.</p> <p class="calibre1">This has none.</p> <p class="calibre1">• This is a test sentence.</p> <p class="calibre1">•</p> <p class="calibre1">This is a test sentence. •</p> <p class="calibre1">This is two test sentences. • This is two test sentences.</p> Last edited by Tex2002ans; 07-17-2012 at 08:33 PM. |
|
07-17-2012, 08:32 PM | #11 |
Evangelist
Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
|
Something like this perhaps?
Code:
(?!<p[^<>]*>•</p>)<([^\s]+)[^<>]*>[^<>]*•[^<>]*</\1> |
07-17-2012, 08:39 PM | #12 |
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
|
07-18-2012, 03:57 AM | #13 |
Addict
Posts: 320
Karma: 56788
Join Date: Jun 2011
Device: Kindle
|
Just for the record, DD's suggestion is closest to what I've done in the past... the only difference being in the symbol i used as a replacement (the fleur de lis instead of the "_@"). the reason i didn't include any other examples is becuase i thought that saying I want to exclude all instances of a bullet when it appears like this:
Code:
<p class="calibre1">•</p> In any case, I thought I'd float it here and see if anyone bit. I've run into the problem more than once in the past, and knew I'd basically totally played out my own knowledge of reg-ex. I really do appreciate you all rolling it around and giving it the old college try. PS - @Tex: as always i especially appreciate your breaking down your thought process and compartmentalizing the behavioral characteristics of your regex. Frankly, it's the other reason I posted my question: I've come to realize that whether or not I get the specific answer I'm looking for, I'll always come out of the thread knowing more about reg-ex than i did going into it. Last edited by ElMiko; 07-18-2012 at 04:03 AM. |
07-18-2012, 04:25 AM | #14 | ||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Would also help save a little time by people who want to help, and make more accurate regexes. Quote:
I thought that \1 could only be used in the Replace, but in Serpentine's example I see it can be used in a Search as well. |
||
07-21-2012, 06:34 PM | #15 |
Connoisseur
Posts: 54
Karma: 37363
Join Date: Aug 2011
Location: Istanbul
Device: EBW1150, Nook STR
|
@ElMiko: This pattern should find all instances of the bullet character, excluding only the ones that you do not want.
Code:
(<p[^>]*?>)?\K•(?(1)(*SKIP)(?!</p>)) (<p[^>]*?>)? : Match 1 or 0(also called optional match) p element opening tag, capture it if it exists. \K : Resets the start of match (anything matched before this will not be included in the match result). • : The match we are interested in, in our case it is the bullet character alone •. (?(1) : If we have successfully found an opening p tag in the first capture group process the following, otherwise ignore them altogether: (*SKIP) : advance the starting point in the next search iteration to here, (?!</p>) : if we find a closing of the p element at this point(also fails the match). ) Some more (wordy) explanations of the parts: (?(1)yes-pattern) : This is a form of conditional subpattern, it searches for the yes-pattern if the condition is true, otherwise it has no effect. The condition in this form is the check for match for the first capture group in our complete pattern. In our case it is the conditional match at the beginning of the pattern. Read as "If the first capture group has matched anything then look for yes-pattern here, if the first capture group is empty(remember that we have used an optional match, so no-match was acceptable to our pattern) then ignore this whole subpattern." yes-pattern in our example is (*SKIP)(?!</p>) (*SKIP) : Normally when a match fails at any point, the starting point in the source that will be next tried for the pattern is advanced one character. (*SKIP) backtracking control verb causes the point to be advanced to where (*SKIP) verb is encountered during the matching if the pattern in whole fails matching. This implies two things are needed for the (*SKIP) verb to have an effect; 1) everything before it has already matched successfully, 2) something in the part of the pattern that comes after it caused a fail in match. (?!</p>) : This is a classic negative lookahead pattern. If the subpattern is found at this point in the search, this causes a fail in the pattern. In our case it looks for </p>, and if this pattern is found here, (?!</p>) causes a fail in match, which in turn causes the (*SKIP) to have its effect and the starting point for the next search is advanced to the (*SKIP) point, which is the character just after the bullet, •. Here is a templated form(whitespaces in here are for ease of reading): (negative_lookbehind_pattern_simulator)? \Kwanted_pattern (?(1) (*SKIP) (?!negative_lookahed_pattern)) Note that unlike regular lookbehind patterns, this form allows for indefinite length matches, because technically it is not a lookbehind assertion but a basic match pattern that starts searching from the current character position; it behaves like a negative lookbehind by the help of the later conditional part of the template. This is the reason why I have labelled this pattern as a "simulator". Corollary: Here is a modified form of this template one can use if only a negative lookbehind alternative is needed, which allows for indefinite length matches. (negative_lookbehind_pattern_simulator)? \Kwanted_pattern (?(1) (*SKIP) (?!)) Here we use (?!) negative lookahead with an empty string as the subpattern. Empty string pattern always matches, hence its negative assertion always fails to match, hence (*SKIP) effect is achieved regardless of what comes after it as long as the first capture group is not empty. The reason for the existence of the (*SKIP) in the pattern is left as an exercise to the reader. The reason for why I am leaving it as an exercise is a sudden lazy spell. |
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Exclude books w/certain tag by default? | ander111 | Library Management | 6 | 01-13-2014 03:15 PM |
How can I exclude all the images from NYT? | Steven630 | Recipes | 1 | 05-11-2012 08:54 AM |
Exclude some parts from build | MartinJT | Calibre | 4 | 09-15-2011 08:39 AM |
Exclude files from indexing? | HansTWN | iRex | 8 | 04-20-2010 05:02 AM |
MobileRead improvements: Exclude forums, et al. | Alexander Turcic | Announcements | 20 | 05-09-2008 06:33 PM |