View Single Post
Old 09-23-2010, 07:48 PM   #44
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
I think moving the flags discussion to the front brings up a rather advanced topic a little early. Ignorecase is handy, but it can be worked around easily enough, and re.DOTALL is only useful in specific cases. I'd put them in the end as an addendum.

Also, you repeat this example twice:
Code:
Hello, World!(?is)
I think you meant the second to be
Code:
(?is)Hello, World!
Expanding the header removal section would probably help. Right now it basically just says grab everything between the <p></p> tags, which interpreted most simply would match every single line in the book. I think you could continue working with that example and build the regex piece by piece.

This was the first example:
Code:
"Maybe, but the cops feel like you do, Anita. What's one more dead vampire? New laws don't change that." </p><p class="calibre4">
<b class="calibre2">Generated by ABC Amber LIT Conv<a href="http://www.processtext.com/abclit.html" class="calibre3">erter, http://www.processtext.com/abclit.html</a></b></p><p class="calibre4">
It had only been two years since Addison v. Clark. The court case gave us a revised version of what life was
If you look at that simple example, you might create a regex that looks like this:
Code:
<p*?>\s*.*?Generated\s+by\s+ABC?\s+Amber.*?</p>
This will work well with that single example, but it's really important to look across the entire book to see what matches with the Magic Wand. As I noted, in the actual book there are other examples where that recommendation would have done very bad things. Here is an example:
Code:
<p class="calibre4">I looked directly at him for a moment. His eyes were still brown. He caught me looking, and I looked down at my desk.</p>
<p class="calibre4">Willie laughed, a wheezing <b class="calibre2">Generated by ABC Amber LIT Conv<a href="http://www.processtext.com/abclit.html" class="calibre3">erter, http://www.processtext.com/abclit.html</a></b>snicker of a sound. The laugh hadn't changed. "Geez, I love it. You're afraid of me."</p>
<p class="calibre4">"Not afraid, just cautious."</p>
You see that in that example it's been injected in the middle of an actual paragraph, so you can't get rid of the everything between the <p></p> tags. We need to do it based on the surround bold tags instead:
Code:
<b*?>.*?Generated\s+by\s+ABC?\s+Amber.*?</b>
The final regex I proposed in the thread you linked was a bit more complicated for a couple reasons. I tried to make it generic so it also supports pdf. I also try to stay away from .* as it can easily match unintended text. That required me to write extra patterns to accommodate the rest of the header. Finally it seems that there are different versions of the Amber tool creating slightly different header variations, so I tried to cover all the ones I'd seen examples of.
ldolse is offline