Quote:
Originally Posted by Tex2002ans
Well they are the exact Regex that I use. By the time I run those, I already have stripped/clean source, not ones riddled with "calibre#", and who knows what other classes! Perhaps I was just subtly recommending that all those classes be cleaned up as well!
I personally wouldn't recommend the one that handles every <p class="">.... because who knows what a given calibre# associates to ("calibre2" could be your typical paragraph, but "calibre3" could be a blockquote (extra margin on the left), "calibre4" could be right alignment, "calibre5" could be small font, etc. etc.).
Example: That "all classes" Regex would break in these cases. Instead of using a <blockquote> tag, the book might have used something along these lines:
Code:
<p>This is a quote from Tex2002ans</p>
<p class="blockquote1">This is a sample blockquote sentence.</p>
<p class="blockquote2">This is some more sentences.</p>
<p class="blockquote2">And this is the end.</p>
<p>Continue with the story.</p>
or the book might have had poetry:
Code:
<p class="poem">This is a poem,</p>
<p class="poem2">that is written by Tex.</p>
<p class="poem">This is a poem,</p>
<p class="poem2">that will break the Regex.</p>
So, long story short, clean up the classes first, then run the nice Regex once you know everything you are piecing together is actually a broken paragraph! 
|
Well, except for poems, such cases would be correctly joined into one whatever-it-is containing the first p's style.
And that is a rationale for double-checking each one, not for writing a regex that doesn't handle lots of stuff.
Alternatively, you can always do it your way... assuming you add another step for clearing up the classes.
FWIW, I agree that my first step would be to clean up the styles, tossing out everything that wasn't
very deliberate.