Ah, I interpreted JS's comment as another vote for false negatives vs. false positives. Using the document as a dict can't guarantee you'll remove every hyphen that should be removed, but it's an excellent technique to ensure that all the ones which are supposed to stay will stay.
Implementing proper multi-language stemming and adding an optional external dictionary would increase the detection rate even more, but it's debatable whether that's worth the effort.
|