Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 10-04-2011, 10:39 PM   #1
DoctorT
Junior Member
DoctorT began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Sep 2011
Device: none
Fixing hyphens and dashes with regular expressions

I'm looking for regular expressions to help fix punctuation problems within the xhtml files of EPUB books. Many ebooks were created via scanning and optical character recognition. In many such books every dash is replaced by a hyphen, and a paragraph may be truncated after the first part of a hyphenated word.

To fix some of the problems, I have been using a sequence of regular expresssions on html documents (using Sigil or BBEdit):

1. -\<\/p\>\S\<p\> (Fixes lines that end with a hyphen.)
2. \S-\S|\S-|-\S (Replace with an em dash.)
3. ("|“|'|‘)- (Replace with quote mark & em dash.)
4. -("|”|'|’) (Replace with em dash & quote mark.)

After the above steps, I manually search for hyphens and, when appropriate, replace them with dashes. I'm looking for a more efficient method. Any advice from regular expression experts?
DoctorT is offline   Reply With Quote
Old 10-04-2011, 10:46 PM   #2
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
There is a dehyphenate function under heuristics which is good at only removing hyphens from the end of lines that should actually be removed while retaining others (and still unwrapping lines when required).

I don't know of a reliable way to determine single hyphens which should be converted back to em/en-dash. The 'smarten punctuation' feature will take any double hyphens and turn them into an em-dash.
ldolse is offline   Reply With Quote
Advert
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Regular Expressions geormes Calibre 4 08-04-2011 07:09 AM
Regular Expressions littleezza Conversion 1 07-15-2011 11:52 AM
Another help with regular expressions encapuchado Library Management 6 06-21-2011 03:14 PM
Help with regular expressions jevonbrady Library Management 6 06-21-2011 10:16 AM
Help with Regular Expressions ghostyjack Workshop 2 01-08-2010 11:04 AM


All times are GMT -4. The time now is 06:29 PM.


MobileRead.com is a privately owned, operated and funded community.