View Single Post
Old 10-19-2011, 05:09 PM   #47
Serpentine
Evangelist
Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.
 
Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
I have a bunch of regex that I have been meaning to put out, for fixing pdf -> epubs.
Most of them are written in python compatible syntax. Some of it may be in JGsoft syntax (I use RegexBuddy for all of my regex work - well worth it if you're fixing a lot of books!). No idea if these are of any value

The first (long) one is to spot empty/whitespace filled elements - a fairly full element list so you can remove anything you quite want to preserve.
Spoiler:
Code:
<(h\d|[uod]l|a|hr|abbr|acronym|address|applet|area|b|base|basefont|bdo|big|blockquote|body|button|caption|center|cite|code|col|colgroup|dd|del|dfn|dir|div|dt|em|fieldset|font|form|frame|frameset|head|hr|html|i|iframe|ins|kbd|label|legend|li|link|map|menu|meta|noframes|noscript|object|optgroup|option|p|param|pre|q|s|samp|script|select|small|span|strike|strong|style|sub|sup|table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|u|var)[\s\w=\-"/\.;_]*>([\s\r\n]|&nbsp;)*</\1>


Trim off white space from the start of a few elements - this will replace every paragraph/div/etc - so be very careful; or use lookaheads...
find : <((div|p|h\d|span)[\s\w=\-"/\.;]*)>\s*(.*?)\s*</\2>
replace : <\1>\3</\1>

Find and fix ellipse with what I like (I like unspaced points) i.e " . . . " -> " ... " and ". .. " -> "... " . Preserves the leading/trailing space.
find : (?P<lead>\s*)(\s?\.){2,3}(?P<trail>\s?)(?![\/\\])
replace : \g<lead>...\g<trail>

Replace ' - ' and ' -- ' as well as a few other cases with emdash - trying to preserve quote spacing (i.e don't space it if before/after a quotation mark)
This is written in JGsoft syntax.
find : (?<=<(?P<wat>p|div|h\d|span)[^<>]*>.*)(?P<lead>(\s*|[“”,]))(\s?\-){2,4}(?P<trail>(\s*|[“”]))(?=.*</(\k<wat>)>)
replace : \g<lead>—\g<trail>

Fix broken paragraphs, not joined due to page breaks - be careful; might want to add a \d to the lowercase capture (^A-Z).
find : <p>([^<]*?)(?!\.)</p>\s+<p>([^A-Z][^<]*?)</p>
replace : <p>\1 \2</p>

Find nested formatting tags - i.e "<i><i>hmm</i></i>" -> "<i>hmm</i>"
find : <([buis])>\s*<\1>(.*?)</\1>\s*</\1>
replace: <\1>\2</\1>

Find and replace redundant formatting : "<i>de</i><i>p</i>" -> <i>derp</i>
find : <([sibu])>(.*)</\1>(\s*)<\1>(.*)</\1>
replace : <\1>\2\3\4</\1>

Find spaced, grouped formatting tags (sbiu) - makes it easy to replace with css later if needed.
find : <([busi])>\s*<([busi])>\s*<(\1|\2)>(.*)</(\1|\2)>\s*</\2>\s*</\1>
replace : <\1><\2>\4</\1></\2>

Stripping spans from paragraphs - be careful if your style is using spans for formatting - replace them with the html tag version.
<p[\s\w=\-";]*>\s*<span[\s\w=\-";]*>(.*)</span>\s*</p>

Find missing quotation marks :
(?<=<(p|div|h\d|span)[^<>]*>)[^"]*"([^"\n\r]*?)(?=</\1>)

Replace straight quotation marks with fancy ones - NB: use the previous regex to find missing ones beforehand! :
find : (?<=<(p|div|h\d|span)[^<>]*>.*)"(?<quote>[^"\u201C\u201D\u201E\u201F\u2033\u2036\r\n]+?)\s*?"(?=.*</\1>)
replace : “\g<quote>”

Find incorrect direction double quotation marks - often Calibre makes mistakes following dashes and such - best to fix manually unless you're sure of the case:
find : (“\s*)(?P<quote>.+?)\s*?(“|”)

Have a bunch of other ones, but they are a bit messy.

Last edited by Serpentine; 10-20-2011 at 12:36 PM. Reason: now with less emotes! and a few fixes
Serpentine is offline   Reply With Quote