MobileRead Forums - View Single Post - New single-source German hyphenation

JSWolf · 07-12-2025, 12:08 PM

Quote:

Originally Posted by Moonbase59

Problem with these is you can’t really tell from the end result, because the patterns are highly-processed and compressed and you’d not easily be able to deduct the original words from that. Also, unfortunately, license and date comments are often removed.

Starting out with "what we have in Linux" is never a bad idea, many years ago I did the same (using igerman, aspell, myspell and later hunspell files), and of course that’s what LibreOffice does. Makes sense, using good building blocks.

The one you’re mentioning doesn’t show where it’s from, and the original text corpus it was built from. It apparently uses an older style format, and the ISO8859-1 character set (instead of UTF-8).

Unfortunately, I already overwrote the (Linux system’s) hyph_de_DE.dic LibreOffice uses nowadays, but I think it was from 2017.

I built mine from an almost 600,000 word text corpus containing preferred (primary) and secondary hyphenation points, as well as common word beginnings and endings, current as of 2025-07-09. For reading devices’ typically real old software that doesn’t understand the Hyphen 2.7+ NOHYPHEN command, I also created large exception lists, so word boundaries are recognized correctly, if characters like punctuation, apostrophes, brackets, invisible nonbreaking spaces etc. are directly adjacent to what should be considered a "word". See screenshots above.

So yes, I suggest that mine should result in much better hyphenation, and it proves true on Linux (including LibreOffice, Sigil, et al) and the devices I own and could test, a Tolino Vision 5 and a Pocketbook Era.

Since I don’t own a Kobo, I must know if my "Kobo version" works correctly before I can release it officially, see the questions in the first post.

My goal is to provide a single-sourced, top quality, uniform hyphenation for most software and e-readers that can use it. All generated from the most current and extremely well maintained corpora the German Dante e.V. Trennmustermannschaft provide for use with LaTeX. Many thanks to them for laying such a fantastic groundwork!

If it turns out that your dictionary is better on a Kobo then the one I made, I'll remove mine and post a link to yours in the hyphenation thread.