View Single Post
Old 10-03-2011, 08:56 AM   #11
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Maybe it's dependent on language, but hyphenated compound words are quite common in english texts. If you turn on full debugging with Calibre it will actually log which hyphens are removed/retained to give you an idea of how frequently it happens. I've also seen numerous cases where em-dash and en-dash got dumbed down to a regular hyphen. Hyphen is also used as a replacment for an ellipsis in many docs, so it would result in sentences being merged together. Numeric strings are other candidates that are often intentionally hyphenated. Line breaking algorithms aggressively leverage existing hyphens.

My own preference has always been for false negative vs. false positive. For a while Calibre was removing hyphens wholesale, and I have to say that was rather annoying to me personally, though I have a perfectionist streak, that's what drove me to add the dictionary approach - my python skills aren't great but it was simple logic to add.
ldolse is offline   Reply With Quote