I used some of my vacation time to further dive into this topic, now with help of Copilot Pro and Ghidra MCP server. (GPT 5.4 is my new favorite for reverse engineering.) These are the highlights:
HTML support is quite good. Dictionary definitions are rendered using Qt 6 text widgets. The supported tags and inline styles are documented here:
https://doc.qt.io/qt-6/richtext-html-subset.html. Even images can be rendered, it turns out. There's one caveat: the e-reader converts literal newlines (\n) to <br> because of backwards compatibility reasons, so it's crucial to remove them when building the dictionary to avoid extra whitespace.
Morphems section (morphems.txt) is completely ignored. There are several stemming engines:
- Hunspell is the preferred one, in case the language package is installed
- if not,
https://github.com/Blake-Madden/OleanderStemmingLibrary is used for supported languages (see its readme for the list)
- there's also this obscure library, mostly for Eastern European languages:
https://github.com/izacus/SlovenianLemmatizer (see /ebrmain/config/lemmagen)
Now you might be wondering how the source language is determined. First, there are dictionaries that are downloaded from PB servers. The metadata for them is stored in /mnt/ext1/system/pbdicts/pbdicts.db (SQLite). Second, there's an optional JSON metadata section, which is practically never used. Finally, there's a fallback which looks at the first line of keyboard.txt, finds the two letters before ':' and treats them as locale, e.g. "EN: English" -> EN. So it's mighty important to bake in the correct keyboard.txt if you want morphology to work properly.
The optional JSON metadata section allows fancy rendering of dictionary info: it has fields "name", "localeFrom", "localeTo", "description", "provider" (aka publisher/issuer, e.g. Wiktionary), "category" (e.g. universal), and a few more fields that don't seem to be used anywhere ("version", "specialProject", "set").
____________________________________
The CLI tool and the WASM UI (
https://datyoma.codeberg.page/pbdt/v1/) now support:
- merging multiple input dictionaries when converting to .dic
- reading and writing JSON metadata section
I also removed legacy conversion of <i></i> and <b></b> tags to binary, as is the case with converter.exe; literal newlines are also stripped when writing and converted to <br> when reading (see above), so input XDXF must use <br> for newlines.
Apart from that, PyGlossary now has a PocketBook output plugin (also HTML-native, so to say):
https://github.com/ilius/pyglossary/pull/708. PyGlossary supports lots of input formats, and is much easier to discover than this thread.