I have a bunch of regex that I have been meaning to put out, for fixing pdf -> epubs.
Most of them are written in python compatible syntax. Some of it may be in JGsoft syntax (I use RegexBuddy for all of my regex work - well worth it if you're fixing a lot of books!). No idea if these are of any value
The first (long) one is to spot empty/whitespace filled elements - a fairly full element list so you can remove anything you quite want to preserve.
Trim off white space from the start of a few elements - this will replace every paragraph/div/etc - so be very careful; or use lookaheads...
find : <((div|p|h\d|span)[\s\w=\-"/\.;]*)>\s*(.*?)\s*</\2>
replace : <\1>\3</\1>
Find and fix ellipse with what I like (I like unspaced points) i.e " . . . " -> " ... " and ". .. " -> "... " . Preserves the leading/trailing space.
find : (?P<lead>\s*)(\s?\.){2,3}(?P<trail>\s?)(?![\/\\])
replace : \g<lead>...\g<trail>
Replace ' - ' and ' -- ' as well as a few other cases with emdash - trying to preserve quote spacing (i.e don't space it if before/after a quotation mark)
This is written in JGsoft syntax.
find : (?<=<(?P<wat>p|div|h\d|span)[^<>]*>.*)(?P<lead>(\s*|[“”,]))(\s?\-){2,4}(?P<trail>(\s*|[“”]))(?=.*</(\k<wat>)>)
replace : \g<lead>—\g<trail>
Fix broken paragraphs, not joined due to page breaks - be careful; might want to add a \d to the lowercase capture (^A-Z).
find : <p>([^<]*?)(?!\.)</p>\s+<p>([^A-Z][^<]*?)</p>
replace : <p>\1 \2</p>
Find nested formatting tags - i.e "<i><i>hmm</i></i>" -> "<i>hmm</i>"
find : <([buis])>\s*<\1>(.*?)</\1>\s*</\1>
replace: <\1>\2</\1>
Find and replace redundant formatting : "<i>de</i><i>p</i>" -> <i>derp</i>
find : <([sibu])>(.*)</\1>(\s*)<\1>(.*)</\1>
replace : <\1>\2\3\4</\1>
Find spaced, grouped formatting tags (sbiu) - makes it easy to replace with css later if needed.
find : <([busi])>\s*<([busi])>\s*<(\1|\2)>(.*)</(\1|\2)>\s*</\2>\s*</\1>
replace : <\1><\2>\4</\1></\2>
Stripping spans from paragraphs - be careful if your style is using spans for formatting - replace them with the html tag version.
<p[\s\w=\-";]*>\s*<span[\s\w=\-";]*>(.*)</span>\s*</p>
Find missing quotation marks :
(?<=<(p|div|h\d|span)[^<>]*>)[^"]*"([^"\n\r]*?)(?=</\1>)
Replace straight quotation marks with fancy ones - NB: use the previous regex to find missing ones beforehand! :
find : (?<=<(p|div|h\d|span)[^<>]*>.*)"(?<quote>[^"\u201C\u201D\u201E\u201F\u2033\u2036\r\n]+?)\s*?"(?=.*</\1>)
replace : “\g<quote>”
Find incorrect direction double quotation marks - often Calibre makes mistakes following dashes and such - best to fix manually unless you're sure of the case:
find : (“\s*)(?P<quote>.+?)\s*?(“|”)
Have a bunch of other ones, but they are a bit messy.