MobileRead Forums - View Single Post - PDF to Kindle: The unobtainable Holy Grail of ebooks

Serpentine · 10-19-2011, 05:09 PM

I have a bunch of regex that I have been meaning to put out, for fixing pdf -> epubs.
Most of them are written in python compatible syntax. Some of it may be in JGsoft syntax (I use RegexBuddy for all of my regex work - well worth it if you're fixing a lot of books!). No idea if these are of any value

The first (long) one is to spot empty/whitespace filled elements - a fairly full element list so you can remove anything you quite want to preserve.

Spoiler:

Trim off white space from the start of a few elements - this will replace every paragraph/div/etc - so be very careful; or use lookaheads...
find : <((div|p|h\d|span)[\s\w=\-"/\.;]*)>\s*(.*?)\s*</\2>
replace : <\1>\3</\1>

Find and fix ellipse with what I like (I like unspaced points) i.e " . . . " -> " ... " and ". .. " -> "... " . Preserves the leading/trailing space.
find : (?P<lead>\s*)(\s?\.){2,3}(?P<trail>\s?)(?![\/\\])
replace : \g<lead>...\g<trail>

Replace ' - ' and ' -- ' as well as a few other cases with emdash - trying to preserve quote spacing (i.e don't space it if before/after a quotation mark)
This is written in JGsoft syntax.
find : (?<=<(?P<wat>p|div|h\d|span)[^<>]*>.*)(?P<lead>(\s*|[“”,]))(\s?\-){2,4}(?P<trail>(\s*|[“”]))(?=.*</(\k<wat>)>)
replace : \g<lead>—\g<trail>

Fix broken paragraphs, not joined due to page breaks - be careful; might want to add a \d to the lowercase capture (^A-Z).
find : ([^<]*?)(?!\.)\s+([^A-Z][^<]*?)
replace : \1 \2

Find nested formatting tags - i.e "hmm" -> "hmm"
find : <([buis])>\s*<\1>(.*?)</\1>\s*</\1>
replace: <\1>\2</\1>

Find and replace redundant formatting : "dep" -> derp
find : <([sibu])>(.*)</\1>(\s*)<\1>(.*)</\1>
replace : <\1>\2\3\4</\1>

Find spaced, grouped formatting tags (sbiu) - makes it easy to replace with css later if needed.
find : <([busi])>\s*<([busi])>\s*<(\1|\2)>(.*)</(\1|\2)>\s*</\2>\s*</\1>
replace : <\1><\2>\4</\1></\2>

Stripping spans from paragraphs - be careful if your style is using spans for formatting - replace them with the html tag version.
<p[\s\w=\-";]*>\s*<span[\s\w=\-";]*>(.*)\s*

Find missing quotation marks :
(?<=<(p|div|h\d|span)[^<>]*>)[^"]*"([^"\n\r]*?)(?=</\1>)

Replace straight quotation marks with fancy ones - NB: use the previous regex to find missing ones beforehand! :
find : (?<=<(p|div|h\d|span)[^<>]*>.*)"(?<quote>[^"\u201C\u201D\u201E\u201F\u2033\u2036\r\n]+?)\s*?"(?=.*</\1>)
replace : “\g<quote>”

Find incorrect direction double quotation marks - often Calibre makes mistakes following dashes and such - best to fix manually unless you're sure of the case:
find : (“\s*)(?P<quote>.+?)\s*?(“|”)

Have a bunch of other ones, but they are a bit messy.

10-19-2011, 05:09 PM	#47
Serpentine Evangelist Posts: 416 Karma: 1045911 Join Date: Sep 2011 Location: Cape Town, South Africa Device: Kindle 3	I have a bunch of regex that I have been meaning to put out, for fixing pdf -> epubs. Most of them are written in python compatible syntax. Some of it may be in JGsoft syntax (I use RegexBuddy for all of my regex work - well worth it if you're fixing a lot of books!). No idea if these are of any value The first (long) one is to spot empty/whitespace filled elements - a fairly full element list so you can remove anything you quite want to preserve. Spoiler: Code: <(h\d\|[uod]l\|a\|hr\|abbr\|acronym\|address\|applet\|area\|b\|base\|basefont\|bdo\|big\|blockquote\|body\|button\|caption\|center\|cite\|code\|col\|colgroup\|dd\|del\|dfn\|dir\|div\|dt\|em\|fieldset\|font\|form\|frame\|frameset\|head\|hr\|html\|i\|iframe\|ins\|kbd\|label\|legend\|li\|link\|map\|menu\|meta\|noframes\|noscript\|object\|optgroup\|option\|p\|param\|pre\|q\|s\|samp\|script\|select\|small\|span\|strike\|strong\|style\|sub\|sup\|table\|tbody\|td\|textarea\|tfoot\|th\|thead\|title\|tr\|tt\|u\|var)[\s\w=\-"/\.;_]>([\s\r\n]\| )</\1> Trim off white space from the start of a few elements - this will replace every paragraph/div/etc - so be very careful; or use lookaheads... find : <((div\|p\|h\d\|span)[\s\w=\-"/\.;])>\s(.?)\s</\2> replace : <\1>\3</\1> Find and fix ellipse with what I like (I like unspaced points) i.e " . . . " -> " ... " and ". .. " -> "... " . Preserves the leading/trailing space. find : (?P<lead>\s)(\s?\.){2,3}(?P<trail>\s?)(?![\/\\]) replace : \g<lead>...\g<trail> Replace ' - ' and ' -- ' as well as a few other cases with emdash - trying to preserve quote spacing (i.e don't space it if before/after a quotation mark) This is written in JGsoft syntax. find : (?<=<(?P<wat>p\|div\|h\d\|span)[^<>]>.)(?P<lead>(\s\|[“”,]))(\s?\-){2,4}(?P<trail>(\s\|[“”]))(?=.</(\k<wat>)>) replace : \g<lead>—\g<trail> Fix broken paragraphs, not joined due to page breaks - be careful; might want to add a \d to the lowercase capture (^A-Z). find : <p>([^<]?)(?!\.)</p>\s+<p>([^A-Z][^<]?)</p> replace : <p>\1 \2</p> Find nested formatting tags - i.e "<i><i>hmm</i></i>" -> "<i>hmm</i>" find : <([buis])>\s<\1>(.?)</\1>\s</\1> replace: <\1>\2</\1> Find and replace redundant formatting : "<i>de</i><i>p</i>" -> <i>derp</i> find : <([sibu])>(.)</\1>(\s)<\1>(.)</\1> replace : <\1>\2\3\4</\1> Find spaced, grouped formatting tags (sbiu) - makes it easy to replace with css later if needed. find : <([busi])>\s<([busi])>\s<(\1\|\2)>(.)</(\1\|\2)>\s</\2>\s</\1> replace : <\1><\2>\4</\1></\2> Stripping spans from paragraphs - be careful if your style is using spans for formatting - replace them with the html tag version. <p[\s\w=\-";]>\s<span[\s\w=\-";]>(.)</span>\s</p> Find missing quotation marks : (?<=<(p\|div\|h\d\|span)[^<>]>)[^"]"([^"\n\r]?)(?=</\1>) Replace straight quotation marks with fancy ones - NB: use the previous regex to find missing ones beforehand! : find : (?<=<(p\|div\|h\d\|span)[^<>]>.)"(?<quote>[^"\u201C\u201D\u201E\u201F\u2033\u2036\r\n]+?)\s?"(?=.</\1>) replace : “\g<quote>” Find incorrect direction double quotation marks - often Calibre makes mistakes following dashes and such - best to fix manually unless you're sure of the case: find : (“\s)(?P<quote>.+?)\s?(“\|”) Have a bunch of other ones, but they are a bit messy. Last edited by Serpentine; 10-20-2011 at 12:36 PM. Reason: now with less emotes! and a few fixes*