MobileRead Forums - View Single Post - reformatting: text with unwanted linebreaks

ldolse · 12-21-2010, 12:41 PM

Quote:

Originally Posted by kiwidude

You are welcome. Regexes can make the otherwise mindless task of tidying up a book conversion more interesting. Ok, not that much, but a little bit

There is a big mental checklist of stuff I go through with every epub I cleanup (not all using regex exclusively of course) including...
- Stripping any "faked" indenting with   & replacing it with an indented justified style
- Ensuring all chapters are given a heading style
- Stripping out nested div tags and replacing divs with paragraphs
- Stripping out tags that are unnecessary when the paragraph css style is set correctly.
- Recombining paragraphs that contain broken sentences
- Replacing incorrect or inadequate quotes around speech. For instance I don't like speech that is 'Some quote' (or worse, an inconsistent combination of " ` ' etc from a bad OCR conversion) and prefer to see “Some quote”

There are still circumstances you won't catch without manually eyeballing but you can fairly quickly turn a very badly formatted document into one that is considerably more pleasant to read.

You mentioned multi-line paragraphs - hopefully you saw you can cope with those in Sigil with my example above by just using \s+ (one or more spaces). You don't have to worry thinking about "newline" characters like \r or \n in Sigil, just use \s+ between the ending/opening tags and that will allow your expression to be matched multi-line.

One final point which is mentioned on a few other threads. You should tick the "Minimal Matching" checkbox on the Find/Replace dialog that is enabled when you choose regular expressions. In fact I haven't needed to uncheck it since finding out it's purpose so pretty much set and forget. It is the only way for certain expressions to work. For instance say your document looks like this with some pointless span tag pairs to remove:
Blah blah text
More text

Find: (.*)
Replace: \1

This says Find *any* text within pairs of and tags and replace it with just the text, thereby removing the outer set of tags. This will only work "correctly" with "Minimal Matching" checkbox turned on.

Several of those functions are logic I've placed in Calibre's preprocess code so I didn't have to find/rewrite the regexes every time I convert a book. The ones that aren't there yet:

Converting ` ' to '' - haven't tried to come up with a safe function to fix that one yet.
I don't mess with tags too much though unless there isn't any content between them. Empty spans and other empty formatting tags get deleted.
I don't see a lot of use of divs except in LRF content, haven't done that one yet either.
Deleting a lot of microsoft junk is still on my to-do, that's partially done though.