View Single Post
Old 09-25-2020, 12:00 AM   #47
geek1011
Wizard
geek1011 ought to be getting tired of karma fortunes by now.geek1011 ought to be getting tired of karma fortunes by now.geek1011 ought to be getting tired of karma fortunes by now.geek1011 ought to be getting tired of karma fortunes by now.geek1011 ought to be getting tired of karma fortunes by now.geek1011 ought to be getting tired of karma fortunes by now.geek1011 ought to be getting tired of karma fortunes by now.geek1011 ought to be getting tired of karma fortunes by now.geek1011 ought to be getting tired of karma fortunes by now.geek1011 ought to be getting tired of karma fortunes by now.geek1011 ought to be getting tired of karma fortunes by now.
 
Posts: 2,804
Karma: 7025947
Join Date: May 2016
Location: Ontario, Canada
Device: Kobo Mini, Aura Edition 2 v1, Clara HD
Quote:
Originally Posted by davidfor View Post
Reading the above, I'm not actually sure what to expect. From this, I think you are saying that there are definitions for both "go" and "went" and a redirect from "went" to "go". I think that you are then saying if I look up "go" I should see the definitions of the words "go". And if I look up "went", I should see the redirect to "go" plus whatever is there for "went". Is that correct? But we are only seeing the definitions for "go" and not the other definitions for "went".

Is that correct? Do you have any other examples? Or maybe a test dictionary that just has a few examples of this.
That's almost correct. One correction: if you look up "went", you should see the "went" definitions, but not the "go" ones (since exact matches take priority over variants). Note that this is an test case I came up with, not a real example from the official English dictionary.

---

A simple example:

bug.df
Spoiler:
Code:
@ go
& went
go & went #1

@ go
& went
go & went #2

@ went
went #1

@ went
went #2

@ test1
& test2
test1 & test2

@ test3
& test2

test3 & test2


See here for the dictfile format. TLDR: @ starts a new entry, & adds a variant, and the lines after it are the body in markdown.

dictgen --v3-prefix-exceptions --output dicthtml-bug-v3.zip bug.df
- This will generate a v3 dictionary from the dictutil dictfile.
- Note that the --v3-prefix-exceptions option hasn't been pushed to GitHub yet, but I can give you the code if you want.
- dictutil will add a prefix exception for the latest occurrence of each variant back to the headword if the prefixes do not match.

dictutil unpack dicthtml-bug-v3.zip && zip -r dicthtml-bug-v3.zip dicthtml-bug-v3

dictgen --output dicthtml-bug-v2.zip bug.df
- This will generate a v2 dictionary from the dictutil dictfile.
- This includes the workaround I implemented in dictutil where variants which do not match the headword prefix will be duplicated into the html file for that variant's prefix.

dictutil unpack dicthtml-bug-v2.zip && zip -r dicthtml-bug-v2.zip dicthtml-bug-v3

Copy dicthtml-bug-v{2,3}.zip to .kobo/custom-dict on 15672+, and try searching "go", "went", "test1", "test2", and "test3" for each dictionary and compare the output.

The output for "test1" and "test3" are equivalent and correct for both versions (each respective word will display on its own).

The output for "test2" is also equivalent and correct for both versions (both "test1" and "test3" will be displayed). Note that if a prefix_exception was added mapping "test2" to "test3", only "test3" will be displayed even though the original word was "test2" and "test1" has a variant of "test2" (because the word from the exception is treated as if it was the original search query).

The output for "go" is equivalent and correct for both versions (the two entries for the word will display consecutively).

The output for "went" is correct on v2, but not on v3. On v2, it will display the two entries for "went", but not the "go" entries (which is expected since an exact match takes precedence over a variant). On v3, it will display the "go" entries only due to the redirect.

There are a few other things related to this, but I didn't include them for the sake of brevity.

Note that most of these issues could be considered the fault of the dictionary creator, as it's possible to work around them by not generating prefix_exceptions for cases like this (since they would be treated correctly without them). The remaining cases can be worked around with a combination of duplicating entries, but that's basically eliminating any advantages of having prefix_exceptions.

Also note that this behaviour would make more sense if we consider prefix_exceptions to be a list of redirects rather than exceptions. Personally, I think this feature would have been more useful as a list of additional prefixes to search for a headword in, but it's too late for that to be changed (and has limited benefit for Kobo's own use case).

A possibility for changing the behaviour of this would be to query the original html file too. Then, the most specific match could be returned (an exact headword match in the original, else an exact variant match in the new one, etc). Alternatively, entries from both could be returned, but this would not be fully backwards-compatible with the v2 behaviour (specifically with entries duplicated into multiple html files as a workaround for the actual bug with variants in v2).

After some thought, I don't think Kobo's official dictionaries are likely to run into this issue unless they either:
- Make redirects for all variants, even if they are in the same file.
- Redirect to a variant which isn't a headword anywhere.
- A few other small possibilities.

The behaviour I'm describing here is more of an inconsistency between v2/v3 than a bug. I don't expect it to be fixed, and I wouldn't even recommend that at this point. I am mainly looking at this so I am aware of the handling of edge cases.

I will be able to take a more definite stance on this once I see the new official dictionaries in October and how they make use of prefix_exceptions. I won't be releasing the dictutil v3 dictionary support until I see the official dictionaries so I can match their behaviour as closely as possible.

Edit: According to @davidfor, the behaviour has been improved in 15676. I'll look at those changes later this week and post in the other thread.
Attached Files
File Type: zip dicthtml-bug-v2.unpacked.zip (1.2 KB, 391 views)
File Type: zip dicthtml-bug-v2.zip (1.1 KB, 387 views)
File Type: zip dicthtml-bug-v3.zip (1.4 KB, 393 views)
File Type: zip dicthtml-bug-v3.unpacked.zip (1.3 KB, 408 views)

Last edited by geek1011; 09-25-2020 at 01:39 AM.
geek1011 is offline   Reply With Quote