Fixing hyphens and dashes with regular expressions
I'm looking for regular expressions to help fix punctuation problems within the xhtml files of EPUB books. Many ebooks were created via scanning and optical character recognition. In many such books every dash is replaced by a hyphen, and a paragraph may be truncated after the first part of a hyphenated word.
To fix some of the problems, I have been using a sequence of regular expresssions on html documents (using Sigil or BBEdit):
1. -\<\/p\>\S\<p\> (Fixes lines that end with a hyphen.)
2. \S-\S|\S-|-\S (Replace with an em dash.)
3. ("|“|'|‘)- (Replace with quote mark & em dash.)
4. -("|”|'|’) (Replace with em dash & quote mark.)
After the above steps, I manually search for hyphens and, when appropriate, replace them with dashes. I'm looking for a more efficient method. Any advice from regular expression experts?
|