Maybe it's dependent on language, but hyphenated compound words are quite common in english texts. If you turn on full debugging with Calibre it will actually log which hyphens are removed/retained to give you an idea of how frequently it happens. I've also seen numerous cases where em-dash and en-dash got dumbed down to a regular hyphen. Hyphen is also used as a replacment for an ellipsis in many docs, so it would result in sentences being merged together. Numeric strings are other candidates that are often intentionally hyphenated. Line breaking algorithms aggressively leverage existing hyphens.
My own preference has always been for false negative vs. false positive. For a while Calibre was removing hyphens wholesale, and I have to say that was rather annoying to me personally, though I have a perfectionist streak, that's what drove me to add the dictionary approach - my python skills aren't great but it was simple logic to add.
|