Help with dictionary lookup feature for non-latin scripts (cyrillic)

helrasincke · 10-08-2021, 07:57 AM

Hi all,

I'm working on a conversion a large monolingual Russian dictionary for use on my old Kindle Touch using the old Mobipocket Creator. So far the html is happy, conversion runs smoothly and everything is displaying well, however the index lookup function doesn't work, which obviously I'd like to fix so I can actually use the thing. Yes, there are many Russian dictionaries already out there for kindle, however they seem to almost universally lack detailed stress and inflection information (I am less interested in working inflection tags than actually having this information in a visible form).

From looking through older posts on this forum I can see that this lack of lookup functionality used to be a big issue for non-latin languages, however I cannot find anything directly addressing my issue. I have actually been using several Russian dictionaries from the net without lookup problems (my KT runs the final version of the firmware available for that model). The searches seem to employ some sort of transliteration, does anyone know if this would be coming from the device firmware or the compiler (most dics I've seen are compiled using dsl2mobi)? Unpacking with kindleunpack has so far not given me any new clues as to what I might be missing if it's a compiler-side feature, there are certainly no transliterations in any of the entries. I can't work out why it works for some dictionaries and not for others.

I'll share some of the tags I've used in case this is of relevance. I previously converted a monolingual Danish dictionary (using Kindle Previewer that time) with a fully working lookup. In that I used the following in the head section:

Code:

<reference title="Look Up Word" type="Find" onclick="index_search('', 'Alphabetical lookup', '', 'none')"/>

and in the entries themselves:

Code:

<idx:entry name="headword" scriptable="yes" spell="yes" id="1">

One Russian dictionary which I unpacked as an example used the following shorter variants (which did not work for me either):

Code:

<reference title="Look Up Word" type="Find" onclick="()"/>
[...]
<idx:entry name="headword" scriptable="yes" id="1">

Short of just manually transliterating my orth tag values (not my favoured solution since I don't learn anything, although it'd be a piece of cake with regex since Russian has no real digraphs to speak of), what other approaches might I try? Is there something really obvious here that I am missing or should I just give up and go back to using goldendict on my laptop?

I'd be most appreciative of any pointers.

Cheers

Doitsu · 10-08-2021, 10:08 AM

If you upgrade your Kindle Touch 4 to at least a Paperwhite 2 and register it, you can download the following Russian dictionaries for free:

ABBYY Lingvo Большой Русско-Английский Словарь (Russian->English)
ABBYY Lingvo Большой Англо-Русский Словарь (English->Russian)
ABBYY Lingvo Большой Толковый Словарь Русского Языка (Russian<->Russian)

If you really want to convert your file, have a look at the sample dictionary source file in the .zip file. (It only contains inflections for слово and книга.)

To generate the dictionary, use the following command line:

Code:

kindlegen.exe russian_dict.epub -dont_append_source

helrasincke · 10-08-2021, 05:44 PM

Thank you for your quick response, Doitsu. As I mentioned in my initial post, I was hoping to gain some insight into the problem itself.

Incidentally, I did actually buy a Paperwhite 4 at the start of the year, partly for the Russian dictionaries. Unfortunately these only indicate stress in the basic form (headword), as you can see. That is, they do not show the minimum information necessary to determine shifting stress or indeed ambiguous or unusual inflections for newly encountered words. It is true that nominal inflections are occasionally given in the illustrative examples however this is by no means systematic. Verbal inflections are rare in the examples and usually only give on of an aspectual pairs (which are grouped together for the most part). Of all the dictionaries I have tested, the otherwise very good Smirnitsky Ru-En dictionary showed most inflection information, but still no stress. One version I found online of the Ru-En Lingvo Universal dictionary showed decent stress information, but then only patchy inflection (obviously more targeted to Russian speakers learning English)... Unfortunately this renders these dictionaries less useful to my purpose than even Wiktionary (which in any case I do not have in a conversion-ready format for).

Happily, I already found a file for the Малый академический словарь showing both pieces of information and covering a sufficiently wide range of vocabularly. It was a breeze to format that into conversion-ready html and everything works until the lookup problem on the KT (works fine on PW, see below why that doesn't help). To clarify, the lookup shows a list of words (in cyrillic as opposed to the expected latin transliteration), but they do not change with further input although the list does vary in relation to the initial input letter (but it is not an obvious relation).

At the end of the day I actually vastly prefer the UI experience on the KT funnily enough (word lookup, highlighting, pop-up menus). Especially irritating is the PW lookup, which has a short delay after the initial keystroke in which the keyboard momentarily disappears, meaning a search "заниматься" will be entered "зниматься" or even "зиматься". If it wasn't for the higher DPI enabling me to now read fullscreen PDFs, I would have sold the device again and just kept the KT. As my KT does not handle PDFs well, it has become my dictionary device, which is why it also does not help me that my file works fine on my PW lookup.

For the reasons outlined above and in my initial post, I'd like to get the KT lookup to work on my own dictionary. I was hoping to shed some light on why the lookup fails for my user generated dictionary when it clearly works for many other user generated dictionaries on the same device. I know it can be done, I'd like to understand how. I was curious if anyone had any ideas of a more elegant solution than my transliteration idea or if that really how it's done?

Cheers,

Doitsu · 10-09-2021, 02:37 PM

Quote:

Originally Posted by helrasincke

I was curious if anyone had any ideas of a more elegant solution than my transliteration idea or if that really how it's done?

After I wrote my initial reply, I also attached the source code for a Russian-English dictionary with two entries with working inflections.
Have you already looked at it and tested it?

helrasincke · 10-14-2021, 07:02 AM

Quote:

Originally Posted by Doitsu

After I wrote my initial reply, I also attached the source code for a Russian-English dictionary with two entries with working inflections.
Have you already looked at it and tested it?

My apologies that I did not see this earlier. I'm really very grateful for you putting that sample together, however looking through the file it seems there may have been a misunderstanding about what I was hoping to achieve. I don't have any issues generating the file or getting it to recognise as a dictionary. I even agree the inflection tags can be handy, but I've spent a fair amount experimenting with various dictionaries, and have found that at least for my usage, they're just not worth the extra effort to collate. Since I know or can guess at basic forms in the languages I read and have the luxury of a spare device anyway, I prefer to use the KT as a stand-alone dictionary and search via index lookup. All of the necessary inflection information (e.g. <b>сня́ться:</b> сниму́сь, сни́мешься, etc.) is displayed visually as part of the entry itself. This is enough for my purpose.

Having more or less solved the problem now at this point, I'll summarise the findings in case it is of interest to anyone out there. The following approaches showed varying levels of success:

• Compiling with only the cyrillic forms works great on PW with the Russian keyboard enabled, but obviously can't be accessed via keyboard lookup on stock KT;
• Compiling with a fully transliterated index forms works on KT but then I can't use the more convenient cyrillic keyboard on PW - it's an edge case for me admittedly, but after all the effort, I want it to be as forward compatible as possible.
• Finally, coming back to my idea of the dual from index (trying cyrillic headword with transliterations in inflection tags) – the index would just default to cyrillic even with a latin key input, I just couldn't get it working... well, until I decided to just throw in the towel and use the dsl2mobi script to see what it spat out, since I could see they were essentially going for the same approach (plus, it generates the inflection tags for free!). It turns out the opposite order – using transliterated headword – does work. And the key piece of the puzzle seems to be using the cyrillic headword as the unique entry id.

Here is my original approach with dual scripts, but only the cyrillic index works:

Code:

<?xml version="1.0" encoding="utf-8"?>
<html xmlns:idx="www.mobipocket.com" xmlns:mbp="www.mobipocket.com" xmlns:xlink="http://www.w3.org/1999/xlink">
  <head>
    <meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />
  </head>
<body>
<mbp:frameset>

<idx:entry name="headword" scriptable="yes" id="66154">
  <idx:short><a id="66154"></a>
    <idx:orth value="сняться">
      <idx:infl>
        <idx:iform value="snyatsya">
      </idx:infl>
    </idx:orth>
      <b>сня́ться </b>
        <div>сниму́сь, сни́мешься; <i>прош.</i> сня́лся, -ла́сь, -ло́сь; <i>сов.</i> (<i>несов.</i> снима́ться). [...]</div>
  </idx:short>
</idx:entry>
<hr>

</mbp:frameset>
</body>
</html>

And here is the dsl2mobi output, which handles either index:

Code:

<?xml version="1.0" encoding="utf-8"?>
<html xmlns:idx="www.mobipocket.com" xmlns:mbp="www.mobipocket.com" xmlns:xlink="http://www.w3.org/1999/xlink">
  <head>
    <meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />
  </head>
<body>
<mbp:frameset>

<a name="#сняться"/>
  <idx:entry name="headword" scriptable="yes">
    <idx:orth>
      <b>сня́ться</b>
    </idx:orth>
      <idx:orth value="snyatsya"/>
        <div>сниму́сь, сни́мешься; <i>прош.</i> сня́лся, -ла́сь, -ло́сь; <i>сов.</i> (<i>несов.</i> снима́ться). [...]</div>
  </idx:entry>
<hr>

</mbp:frameset>
</body>
</html>

So there you have it, I managed to solve the problem of dual script lookup. And what's the lesson? Next time I'll just use the script...

10-08-2021, 07:57 AM	#1
helrasincke Junior Member Posts: 3 Karma: 10 Join Date: Sep 2020 Device: Kindle Touch 4	Help with dictionary lookup feature for non-latin scripts (cyrillic) Hi all, I'm working on a conversion a large monolingual Russian dictionary for use on my old Kindle Touch using the old Mobipocket Creator. So far the html is happy, conversion runs smoothly and everything is displaying well, however the index lookup function doesn't work, which obviously I'd like to fix so I can actually use the thing. Yes, there are many Russian dictionaries already out there for kindle, however they seem to almost universally lack detailed stress and inflection information (I am less interested in working inflection tags than actually having this information in a visible form). From looking through older posts on this forum I can see that this lack of lookup functionality used to be a big issue for non-latin languages, however I cannot find anything directly addressing my issue. I have actually been using several Russian dictionaries from the net without lookup problems (my KT runs the final version of the firmware available for that model). The searches seem to employ some sort of transliteration, does anyone know if this would be coming from the device firmware or the compiler (most dics I've seen are compiled using dsl2mobi)? Unpacking with kindleunpack has so far not given me any new clues as to what I might be missing if it's a compiler-side feature, there are certainly no transliterations in any of the entries. I can't work out why it works for some dictionaries and not for others. I'll share some of the tags I've used in case this is of relevance. I previously converted a monolingual Danish dictionary (using Kindle Previewer that time) with a fully working lookup. In that I used the following in the head section: Code: <reference title="Look Up Word" type="Find" onclick="index_search('', 'Alphabetical lookup', '', 'none')"/> and in the entries themselves: Code: <idx:entry name="headword" scriptable="yes" spell="yes" id="1"> One Russian dictionary which I unpacked as an example used the following shorter variants (which did not work for me either): Code: <reference title="Look Up Word" type="Find" onclick="()"/> [...] <idx:entry name="headword" scriptable="yes" id="1"> Short of just manually transliterating my orth tag values (not my favoured solution since I don't learn anything, although it'd be a piece of cake with regex since Russian has no real digraphs to speak of), what other approaches might I try? Is there something really obvious here that I am missing or should I just give up and go back to using goldendict on my laptop? I'd be most appreciative of any pointers. Cheers

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Reference Whitaker's Words Latin Dictionary	pruss	Kindle Books	5	01-26-2018 03:45 AM
Feature request: smart dictionary lookup for French	holymadness	Marvin	8	01-08-2015 06:50 PM
902 latin dictionary on pb902	teofrast	PocketBook	14	02-27-2011 12:27 PM
Any eReader with dictionary lookup feature?	bthoven	Which one should I buy?	19	10-06-2009 02:37 PM

10-08-2021, 05:44 PM	#3
helrasincke Junior Member Posts: 3 Karma: 10 Join Date: Sep 2020 Device: Kindle Touch 4	Thank you for your quick response, Doitsu. As I mentioned in my initial post, I was hoping to gain some insight into the problem itself. Incidentally, I did actually buy a Paperwhite 4 at the start of the year, partly for the Russian dictionaries. Unfortunately these only indicate stress in the basic form (headword), as you can see. That is, they do not show the minimum information necessary to determine shifting stress or indeed ambiguous or unusual inflections for newly encountered words. It is true that nominal inflections are occasionally given in the illustrative examples however this is by no means systematic. Verbal inflections are rare in the examples and usually only give on of an aspectual pairs (which are grouped together for the most part). Of all the dictionaries I have tested, the otherwise very good Smirnitsky Ru-En dictionary showed most inflection information, but still no stress. One version I found online of the Ru-En Lingvo Universal dictionary showed decent stress information, but then only patchy inflection (obviously more targeted to Russian speakers learning English)... Unfortunately this renders these dictionaries less useful to my purpose than even Wiktionary (which in any case I do not have in a conversion-ready format for). Happily, I already found a file for the Малый академический словарь showing both pieces of information and covering a sufficiently wide range of vocabularly. It was a breeze to format that into conversion-ready html and everything works until the lookup problem on the KT (works fine on PW, see below why that doesn't help). To clarify, the lookup shows a list of words (in cyrillic as opposed to the expected latin transliteration), but they do not change with further input although the list does vary in relation to the initial input letter (but it is not an obvious relation). At the end of the day I actually vastly prefer the UI experience on the KT funnily enough (word lookup, highlighting, pop-up menus). Especially irritating is the PW lookup, which has a short delay after the initial keystroke in which the keyboard momentarily disappears, meaning a search "заниматься" will be entered "зниматься" or even "зиматься". If it wasn't for the higher DPI enabling me to now read fullscreen PDFs, I would have sold the device again and just kept the KT. As my KT does not handle PDFs well, it has become my dictionary device, which is why it also does not help me that my file works fine on my PW lookup. For the reasons outlined above and in my initial post, I'd like to get the KT lookup to work on my own dictionary. I was hoping to shed some light on why the lookup fails for my user generated dictionary when it clearly works for many other user generated dictionaries on the same device. I know it can be done, I'd like to understand how. I was curious if anyone had any ideas of a more elegant solution than my transliteration idea or if that really how it's done? Cheers,

Advert

Advert