View Single Post
Old 12-23-2014, 02:13 PM   #22
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Nyssa View Post
Oh! Sorry, I thought I could choose between options. I didn't realize they were steps... Hyphens aren't a problem, so I figured I didn't need Regex #2 or its variant.
Ahhh, sorry, I am used to typing in the technical sections of MobileRead, and I make certain assumptions about the general knowledge that the user has (for example, already knowing what Regex is, and how to use/read it).

Next time, I will have to be even more specific. (Typically, I color code all the sections of the Regex too!).

And yes, they have to be run #1, then #2 (or its variant, depending on if you want to do hyphenation fixes now or later), then #3.

I have a lot more I Regex I recommend after that, although it might get a little too technical in here. (And it does take forever to write these things).

Side Note: I convert a ton of non-fiction economics books from PDF -> EPUB, and deal with cleaning up a lot of crap. I use those regex to mostly piece together lines/paragraphs that broke across pages, or were OCRed incorrectly.

Quote:
Originally Posted by eschwartz View Post
Tex2002ans, your regexes don't account for classes, how could you!
Well they are the exact Regex that I use. By the time I run those, I already have stripped/clean source, not ones riddled with "calibre#", and who knows what other classes! Perhaps I was just subtly recommending that all those classes be cleaned up as well!

I personally wouldn't recommend the one that handles every <p class="">.... because who knows what a given calibre# associates to ("calibre2" could be your typical paragraph, but "calibre3" could be a blockquote (extra margin on the left), "calibre4" could be right alignment, "calibre5" could be small font, etc. etc.).

Example: That "all classes" Regex would break in these cases. Instead of using a <blockquote> tag, the book might have used something along these lines:

Code:
<p>This is a quote from Tex2002ans</p>
<p class="blockquote1">This is a sample blockquote sentence.</p>
<p class="blockquote2">This is some more sentences.</p>
<p class="blockquote2">And this is the end.</p>
<p>Continue with the story.</p>
or the book might have had poetry:

Code:
<p class="poem">This is a poem,</p>
<p class="poem2">that is written by Tex.</p>
<p class="poem">This is a poem,</p>
<p class="poem2">that will break the Regex.</p>
So, long story short, clean up the classes first, then run the nice Regex once you know everything you are piecing together is actually a broken paragraph!

Quote:
Originally Posted by theducks View Post
The code tips we toss out are generic solutions that may need to be honed to fit a specific case. Never fully trust a bulk (replace all) change.
Oh yeah, definitely don't press "Replace All" while running a Regex, unless you know EXACTLY what it is doing. Even then, be sure to thoroughly test them.

Even though I trust Regex #2 and Regex #3 with my life, I still have them in my Sigil's Saved Searches under the heading, "One at a Time".

Last edited by Tex2002ans; 12-23-2014 at 02:40 PM.
Tex2002ans is offline   Reply With Quote