MobileRead Forums - View Single Post - How can I fix it when every line is a paragraph?

eschwartz · 12-23-2014, 02:44 PM

Quote:

Originally Posted by Tex2002ans

Well they are the exact Regex that I use. By the time I run those, I already have stripped/clean source, not ones riddled with "calibre#", and who knows what other classes! Perhaps I was just subtly recommending that all those classes be cleaned up as well!

I personally wouldn't recommend the one that handles every <p class="">.... because who knows what a given calibre# associates to ("calibre2" could be your typical paragraph, but "calibre3" could be a blockquote (extra margin on the left), "calibre4" could be right alignment, "calibre5" could be small font, etc. etc.).

Example: That "all classes" Regex would break in these cases. Instead of using a <blockquote> tag, the book might have used something along these lines:

Code:

<p>This is a quote from Tex2002ans</p>
<p class="blockquote1">This is a sample blockquote sentence.</p>
<p class="blockquote2">This is some more sentences.</p>
<p class="blockquote2">And this is the end.</p>
<p>Continue with the story.</p>

or the book might have had poetry:

Code:

<p class="poem">This is a poem,</p>
<p class="poem2">that is written by Tex.</p>
<p class="poem">This is a poem,</p>
<p class="poem2">that will break the Regex.</p>

So, long story short, clean up the classes first, then run the nice Regex once you know everything you are piecing together is actually a broken paragraph!

Well, except for poems, such cases would be correctly joined into one whatever-it-is containing the first p's style.

And that is a rationale for double-checking each one, not for writing a regex that doesn't handle lots of stuff.

Alternatively, you can always do it your way... assuming you add another step for clearing up the classes.

FWIW, I agree that my first step would be to clean up the styles, tossing out everything that wasn't very deliberate.