View Single Post
Old 12-23-2014, 02:44 PM   #24
eschwartz
Ex-Helpdesk Junkie
eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.
 
eschwartz's Avatar
 
Posts: 19,421
Karma: 85400180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
Quote:
Originally Posted by Tex2002ans View Post
Well they are the exact Regex that I use. By the time I run those, I already have stripped/clean source, not ones riddled with "calibre#", and who knows what other classes! Perhaps I was just subtly recommending that all those classes be cleaned up as well!

I personally wouldn't recommend the one that handles every <p class="">.... because who knows what a given calibre# associates to ("calibre2" could be your typical paragraph, but "calibre3" could be a blockquote (extra margin on the left), "calibre4" could be right alignment, "calibre5" could be small font, etc. etc.).

Example: That "all classes" Regex would break in these cases. Instead of using a <blockquote> tag, the book might have used something along these lines:

Code:
<p>This is a quote from Tex2002ans</p>
<p class="blockquote1">This is a sample blockquote sentence.</p>
<p class="blockquote2">This is some more sentences.</p>
<p class="blockquote2">And this is the end.</p>
<p>Continue with the story.</p>
or the book might have had poetry:

Code:
<p class="poem">This is a poem,</p>
<p class="poem2">that is written by Tex.</p>
<p class="poem">This is a poem,</p>
<p class="poem2">that will break the Regex.</p>
So, long story short, clean up the classes first, then run the nice Regex once you know everything you are piecing together is actually a broken paragraph!
Well, except for poems, such cases would be correctly joined into one whatever-it-is containing the first p's style.

And that is a rationale for double-checking each one, not for writing a regex that doesn't handle lots of stuff.

Alternatively, you can always do it your way... assuming you add another step for clearing up the classes.


FWIW, I agree that my first step would be to clean up the styles, tossing out everything that wasn't very deliberate.

Last edited by eschwartz; 12-23-2014 at 08:25 PM.
eschwartz is offline   Reply With Quote