![]() |
#1 |
Junior Member
![]() Posts: 4
Karma: 10
Join Date: Sep 2011
Device: none
|
Fixing hyphens and dashes with regular expressions
I'm looking for regular expressions to help fix punctuation problems within the xhtml files of EPUB books. Many ebooks were created via scanning and optical character recognition. In many such books every dash is replaced by a hyphen, and a paragraph may be truncated after the first part of a hyphenated word.
To fix some of the problems, I have been using a sequence of regular expresssions on html documents (using Sigil or BBEdit): 1. -\<\/p\>\S\<p\> (Fixes lines that end with a hyphen.) 2. \S-\S|\S-|-\S (Replace with an em dash.) 3. ("|“|'|‘)- (Replace with quote mark & em dash.) 4. -("|”|'|’) (Replace with em dash & quote mark.) After the above steps, I manually search for hyphens and, when appropriate, replace them with dashes. I'm looking for a more efficient method. Any advice from regular expression experts? |
![]() |
![]() |
![]() |
#2 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
There is a dehyphenate function under heuristics which is good at only removing hyphens from the end of lines that should actually be removed while retaining others (and still unwrapping lines when required).
I don't know of a reliable way to determine single hyphens which should be converted back to em/en-dash. The 'smarten punctuation' feature will take any double hyphens and turn them into an em-dash. |
![]() |
![]() |
Advert | |
|
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Regular Expressions | geormes | Calibre | 4 | 08-04-2011 07:09 AM |
Regular Expressions | littleezza | Conversion | 1 | 07-15-2011 11:52 AM |
Another help with regular expressions | encapuchado | Library Management | 6 | 06-21-2011 03:14 PM |
Help with regular expressions | jevonbrady | Library Management | 6 | 06-21-2011 10:16 AM |
Help with Regular Expressions | ghostyjack | Workshop | 2 | 01-08-2010 11:04 AM |