View Single Post
Old 01-14-2015, 06:16 PM   #211
jveth
Junior Member
jveth began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Jan 2015
Location: Hungary
Device: Kindle5
Incorrect non-standard hyphenation, e.g. hungarian

Hi Saulus, thank you very much for this plugin. I am very glad I found this.
However, the plugin unfortunately does not support non-standard hyphenations, although the recent libreoffice dictionaries do.
This results in incorrect soft-hyphenation also for me, as others already reported in posts
#101 by mattheo
#141 by karakai
#207 by imaginer

I am a bit surprised, that only hungarians report such problems. since, according to the very nice documentation in tb87nemeth.pdf , still there are much more languages involved, including (although perhaps less frequent as in hungarian), also english and german.

Anyway, for this plugin, there does not seem any progress for non-standard hyphenation since #101 (may 2003).
In order to try to help you (and us hungarians as well ), I decided to dig into this a bit more.


1) While the pyhyphen-2.0.5 module works as expected, still the hyphenator-0.5.1 module cannot handle the non-standard cases.

Although I do not know python, still I succeded to construct the following simple test script:

Code:
from hyphen import Hyphenator as Hypy2				# pyhyphen-2.0.5
h2hu = Hypy2('hu_HU')
def hypins2hu(p): return '-'.join(h2hu.syllables(p))

from zzhyphenator import Hyphenator as Hyp0			# hyphenator-0.5.1 (Berendsen,2008)
h0hu = Hyp0('/usr/share/hyphen/hyph_hu_HU.dic')
hypins0hu = h0hu.inserted

def printithu(zz, txt):
    print '2.0.5: ' , hypins2hu(zz) , ' --- ' , '"' + zz + '", as reported by' , txt
    print '0.5.1: ' , hypins0hu(zz)
    return

printithu(u'valamennyit',	'mattheo #101')
printithu(u'valamennyi',	'mattheo #101')
printithu(u'rosszabb',		'mattheo #101')

printithu(u'poggyászomban',	'imaginer #207')
Without adding my own findings, only running the test for the reported cases:

Quote:
2.0.5: va-la-meny-nyit --- "valamennyit", as reported by mattheo #101
0.5.1: va-la-meny-nyt
2.0.5: va-la-meny-nyi --- "valamennyi", as reported by mattheo #101
0.5.1: va-la-meny-ny
2.0.5: rosz-szabb --- "rosszabb", as reported by mattheo #101
0.5.1: rosz-zabb
2.0.5: pogy-gyá-szom-ban --- "poggyászomban", as reported by imaginer #207
0.5.1: pogy-yá-szom-ban
clearly show, that the soft-hyphened words in 0.5.1 obviously are corrupted. They differ from both the expected valid hyphenated and also the non-hyphenated version. Also, they are no more identical to the original, when removing soft hyphens that mostly happens when reflowing the text, no actual hyphen needed at this point.

2) Unfortunately, even with correct hyphenator, the non-standard break-points cannot be used for SHY. I hope this is obvious from the above examples.

Still insisting on my not knowing python , I changed hjob.py in the plugin as follows:

Code:
                for w in wlist:
                    if len(w) >= min_len and u'-' not in w:
                        ww = w
                        for hh in h.iterate(ww):
                            #print 'Hyphenator hint: ' , hh[0] , '-' , hh[1]<---># trace
                            if hh[0] + hh[1] == ww:
                                #w = hh[0] + '-' + w[len(hh[0]):]<------><------># *** TEST *** see all possibilities
                                w = hh[0] + u'\u00AD' + w[len(hh[0]):]
                    newt += w
this now seems to do a much better job .

3) My first thought was, of course, to use the actual pyhyphen module. However, when trying to include the standard installed pyhyphen module in my system, I was compelled to discover that, without sufficiently knowing python, still it is beyond my scopes.

So the next thing for me was to check, whith ignoring the non-standard cases, whether the old hyphenator works well with current dictionaries.
Conclusion: Using the 100.000 word large sample from the Hungarian gigaword corpus referred to by tb87nemeth.pdf, I found that the old module still seems OK at the moment. That is, all standard hyphenation points are identically observed by both 0.5.1 and 2.0.5.

Last edited by jveth; 01-30-2015 at 07:20 PM.
jveth is offline   Reply With Quote