Quote:
Originally Posted by Nyssa
Oh! Sorry, I thought I could choose between options. I didn't realize they were steps... Hyphens aren't a problem, so I figured I didn't need Regex #2 or its variant.
|
Ahhh, sorry, I am used to typing in the technical sections of MobileRead, and I make certain assumptions about the general knowledge that the user has (for example, already knowing what Regex is, and how to use/read it).
Next time, I will have to be even more specific. (Typically, I color code all the sections of the Regex too!).
And yes, they have to be run #1, then #2 (or its variant, depending on if you want to do hyphenation fixes now or later), then #3.
I have a lot more I Regex I recommend after that, although it might get a little too technical in here. (And it does take forever to write these things).
Side Note: I convert a ton of non-fiction economics books from PDF -> EPUB, and deal with cleaning up a lot of crap. I use those regex to mostly piece together lines/paragraphs that broke across pages, or were OCRed incorrectly.
Quote:
Originally Posted by eschwartz
Tex2002ans, your regexes don't account for classes, how could you! 
|
Well they are the exact Regex that I use. By the time I run those, I already have stripped/clean source, not ones riddled with "calibre#", and who knows what other classes! Perhaps I was just subtly recommending that all those classes be cleaned up as well!
I personally wouldn't recommend the one that handles every <p class="">.... because who knows what a given calibre# associates to ("calibre2" could be your typical paragraph, but "calibre3" could be a blockquote (extra margin on the left), "calibre4" could be right alignment, "calibre5" could be small font, etc. etc.).
Example: That "all classes" Regex would break in these cases. Instead of using a <blockquote> tag, the book might have used something along these lines:
Code:
<p>This is a quote from Tex2002ans</p>
<p class="blockquote1">This is a sample blockquote sentence.</p>
<p class="blockquote2">This is some more sentences.</p>
<p class="blockquote2">And this is the end.</p>
<p>Continue with the story.</p>
or the book might have had poetry:
Code:
<p class="poem">This is a poem,</p>
<p class="poem2">that is written by Tex.</p>
<p class="poem">This is a poem,</p>
<p class="poem2">that will break the Regex.</p>
So, long story short, clean up the classes first, then run the nice Regex once you know everything you are piecing together is actually a broken paragraph!
Quote:
Originally Posted by theducks
The code tips we toss out are generic solutions that may need to be honed to fit a specific case.  Never fully trust a bulk (replace all) change.
|
Oh yeah, definitely don't press "Replace All" while running a Regex, unless you know EXACTLY what it is doing. Even then, be sure to thoroughly test them.
Even though I trust Regex #2 and Regex #3 with my life, I still have them in my Sigil's Saved Searches under the heading, "One at a Time".