Pocketbook dictionary format revisted - Page 27

nhedgehog · 03-07-2026, 06:06 AM

Very cool, thanks for the effort and sharing it!

rkomar · 03-07-2026, 10:37 AM

Nice work, @datyoma! I'm a linux guy, and it was a pain running the PocketBook dictionary converter in a windows VM. It will be nice to have a native tool for that.

Markismus · 03-07-2026, 11:54 AM

Nice!

datyoma · 03-07-2026, 04:19 PM

@rkomar so am I, the only relatively sane way I found to run the converter.exe was by building a Docker image with wine32 in it.

Code:

FROM debian:bookworm-slim

RUN dpkg --add-architecture i386 && \
    apt-get update && apt-get install -y --no-install-recommends wine wine32

WORKDIR /data
ENTRYPOINT ["wine", "converter.exe"]

So in case there's an issue, the debugging workflow is as follows:
- run the original converter (docker run -v $(pwd):/data pb-converter input.xdxf meta/), rename the output file to input.orig.dic
- run the new converter (./pbdt convert input.xdxf --meta-dir=meta/)
- run ./pbdt show sdic index on both files
- run ./pbdt show sdic block <file.dic> <offset> on two similar blocks (lists the keys) and feed it into vimdiff or the like to compare block contents
- finally, there's ./pbdt lookup --debug command for checking raw definition bytes

The converted files are very similar, but usually not byte-to-byte identical due to:
- non-stable sorting of words with the same collated key in the original converter
- minor differences in selecting block boundaries
- treatment of special XML characters - I didn't pay much attention to this
- not yet discovered bugs

pLEX · 03-09-2026, 06:10 AM

Quote:

Originally Posted by Markismus

@pLEX i ran the 3 mobi-files through the script and none was correctly parsed.

Hi
Thank you for your efforts!

I'll try lurking among internal communities, maybe someone has UA-UA dictionary for PocketBook.

datyoma · Today, 05:59 AM

I used some of my vacation time to further dive into this topic, now with help of Copilot Pro and Ghidra MCP server. (GPT 5.4 is my new favorite for reverse engineering.) These are the highlights:

HTML support is quite good. Dictionary definitions are rendered using Qt 6 text widgets. The supported tags and inline styles are documented here: https://doc.qt.io/qt-6/richtext-html-subset.html. Even images can be rendered, it turns out. There's one caveat: the e-reader converts literal newlines (\n) to because of backwards compatibility reasons, so it's crucial to remove them when building the dictionary to avoid extra whitespace.

Morphems section (morphems.txt) is completely ignored. There are several stemming engines:
- Hunspell is the preferred one, in case the language package is installed
- if not, https://github.com/Blake-Madden/OleanderStemmingLibrary is used for supported languages (see its readme for the list)
- there's also this obscure library, mostly for Eastern European languages: https://github.com/izacus/SlovenianLemmatizer (see /ebrmain/config/lemmagen)

Now you might be wondering how the source language is determined. First, there are dictionaries that are downloaded from PB servers. The metadata for them is stored in /mnt/ext1/system/pbdicts/pbdicts.db (SQLite). Second, there's an optional JSON metadata section, which is practically never used. Finally, there's a fallback which looks at the first line of keyboard.txt, finds the two letters before ':' and treats them as locale, e.g. "EN: English" -> EN. So it's mighty important to bake in the correct keyboard.txt if you want morphology to work properly.

The optional JSON metadata section allows fancy rendering of dictionary info: it has fields "name", "localeFrom", "localeTo", "description", "provider" (aka publisher/issuer, e.g. Wiktionary), "category" (e.g. universal), and a few more fields that don't seem to be used anywhere ("version", "specialProject", "set").
____________________________________

The CLI tool and the WASM UI (https://datyoma.codeberg.page/pbdt/v1/) now support:
- merging multiple input dictionaries when converting to .dic
- reading and writing JSON metadata section

I also removed legacy conversion of and tags to binary, as is the case with converter.exe; literal newlines are also stripped when writing and converted to when reading (see above), so input XDXF must use for newlines.

Apart from that, PyGlossary now has a PocketBook output plugin (also HTML-native, so to say): https://github.com/ilius/pyglossary/pull/708. PyGlossary supports lots of input formats, and is much easier to discover than this thread.

rkomar · Today, 03:26 PM

What do you mean that morphems.txt is ignored? Ignored by the converter, ignored by the PB dictionary code on the device,...?

datyoma · Today, 05:44 PM

It's stored in the dictionary, but ignored by PB dictionary reading code on the device.

Code:

$ readelf -CsW cramfs/lib/libdictionary.so | grep comparatorForQueryWord
334: ... pocketbook::morphology::OlenderStemSearcher::comparatorForQueryWord(...) const
392: ... pocketbook::dictionary::ChainingSearcher::comparatorForQueryWord(...) const
487: ... pocketbook::dictionary::CombinedSearcher::comparatorForQueryWord(...) const
611: ... pocketbook::dictionary::ExactSearcher::comparatorForQueryWord(...) const
614: ... pocketbook::morphology::HunspellSearcher::comparatorForQueryWord(...) const
920: ... pocketbook::morphology::LemmaGenSearcher::comparatorForQueryWord(...) const

The chaining searcher tries Hunspell, OlenderStem, LemmaGen and exact searchers in that order.
The combined searcher seems to be dead code; it is supposed to dispatch to other searchers based on locale, but there are no calls to constructor nor creation of that mapping.
There's no trace of morphems.txt being utilised anywhere, in modern firmware at least.

rkomar · Today, 05:49 PM

Okay, thanks for the explanation. Perhaps it is just on old devices that it is used.

Edit: I'm glad you found this information out. I had in the back of my mind a project to produce morphems.txt rules from hunspell rules, but I see now that that would be a waste of time on modern devices.

03-07-2026, 04:19 PM	#394
datyoma Junior Member Posts: 4 Karma: 94894 Join Date: Mar 2026 Location: Berlin Device: PocketBook InkPad 4	@rkomar so am I, the only relatively sane way I found to run the converter.exe was by building a Docker image with wine32 in it. Code: FROM debian:bookworm-slim RUN dpkg --add-architecture i386 && \ apt-get update && apt-get install -y --no-install-recommends wine wine32 WORKDIR /data ENTRYPOINT ["wine", "converter.exe"] So in case there's an issue, the debugging workflow is as follows: - run the original converter (docker run -v $(pwd):/data pb-converter input.xdxf meta/), rename the output file to input.orig.dic - run the new converter (./pbdt convert input.xdxf --meta-dir=meta/) - run ./pbdt show sdic index on both files - run ./pbdt show sdic block <file.dic> <offset> on two similar blocks (lists the keys) and feed it into vimdiff or the like to compare block contents - finally, there's ./pbdt lookup --debug command for checking raw definition bytes The converted files are very similar, but usually not byte-to-byte identical due to: - non-stable sorting of words with the same collated key in the original converter - minor differences in selecting block boundaries - treatment of special XML characters - I didn't pay much attention to this - not yet discovered bugs

Today, 05:59 AM	#396
datyoma Junior Member Posts: 4 Karma: 94894 Join Date: Mar 2026 Location: Berlin Device: PocketBook InkPad 4	I used some of my vacation time to further dive into this topic, now with help of Copilot Pro and Ghidra MCP server. (GPT 5.4 is my new favorite for reverse engineering.) These are the highlights: HTML support is quite good. Dictionary definitions are rendered using Qt 6 text widgets. The supported tags and inline styles are documented here: https://doc.qt.io/qt-6/richtext-html-subset.html. Even images can be rendered, it turns out. There's one caveat: the e-reader converts literal newlines (\n) to <br> because of backwards compatibility reasons, so it's crucial to remove them when building the dictionary to avoid extra whitespace. Morphems section (morphems.txt) is completely ignored. There are several stemming engines: - Hunspell is the preferred one, in case the language package is installed - if not, https://github.com/Blake-Madden/OleanderStemmingLibrary is used for supported languages (see its readme for the list) - there's also this obscure library, mostly for Eastern European languages: https://github.com/izacus/SlovenianLemmatizer (see /ebrmain/config/lemmagen) Now you might be wondering how the source language is determined. First, there are dictionaries that are downloaded from PB servers. The metadata for them is stored in /mnt/ext1/system/pbdicts/pbdicts.db (SQLite). Second, there's an optional JSON metadata section, which is practically never used. Finally, there's a fallback which looks at the first line of keyboard.txt, finds the two letters before ':' and treats them as locale, e.g. "EN: English" -> EN. So it's mighty important to bake in the correct keyboard.txt if you want morphology to work properly. The optional JSON metadata section allows fancy rendering of dictionary info: it has fields "name", "localeFrom", "localeTo", "description", "provider" (aka publisher/issuer, e.g. Wiktionary), "category" (e.g. universal), and a few more fields that don't seem to be used anywhere ("version", "specialProject", "set"). ____________________________________ The CLI tool and the WASM UI (https://datyoma.codeberg.page/pbdt/v1/) now support: - merging multiple input dictionaries when converting to .dic - reading and writing JSON metadata section I also removed legacy conversion of <i></i> and <b></b> tags to binary, as is the case with converter.exe; literal newlines are also stripped when writing and converted to <br> when reading (see above), so input XDXF must use <br> for newlines. Apart from that, PyGlossary now has a PocketBook output plugin (also HTML-native, so to say): https://github.com/ilius/pyglossary/pull/708. PyGlossary supports lots of input formats, and is much easier to discover than this thread.

Today, 05:44 PM	#398
datyoma Junior Member Posts: 4 Karma: 94894 Join Date: Mar 2026 Location: Berlin Device: PocketBook InkPad 4	It's stored in the dictionary, but ignored by PB dictionary reading code on the device. Code: $ readelf -CsW cramfs/lib/libdictionary.so \| grep comparatorForQueryWord 334: ... pocketbook::morphology::OlenderStemSearcher::comparatorForQueryWord(...) const 392: ... pocketbook::dictionary::ChainingSearcher::comparatorForQueryWord(...) const 487: ... pocketbook::dictionary::CombinedSearcher::comparatorForQueryWord(...) const 611: ... pocketbook::dictionary::ExactSearcher::comparatorForQueryWord(...) const 614: ... pocketbook::morphology::HunspellSearcher::comparatorForQueryWord(...) const 920: ... pocketbook::morphology::LemmaGenSearcher::comparatorForQueryWord(...) const The chaining searcher tries Hunspell, OlenderStem, LemmaGen and exact searchers in that order. The combined searcher seems to be dead code; it is supposed to dispatch to other searchers based on locale, but there are no calls to constructor nor creation of that mapping. There's no trace of morphems.txt being utilised anywhere, in modern firmware at least.

Today, 05:49 PM	#399
rkomar Wizard Posts: 3,111 Karma: 18944169 Join Date: Oct 2010 Location: Sudbury, ON, Canada Device: PRS-505, PB 902, PRS-T1, PB 623, PB 840, PB 633	Okay, thanks for the explanation. Perhaps it is just on old devices that it is used. Edit: I'm glad you found this information out. I had in the back of my mind a project to produce morphems.txt rules from hunspell rules, but I see now that that would be a waste of time on modern devices. Last edited by rkomar; Today at 05:51 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Pocketbook dictionary	logan	PocketBook	324	02-13-2026 02:19 PM
Dictionary coversion from .mobi to pocketbook format?	doctorat	PocketBook	16	07-01-2020 05:34 PM
Webster's 1913 Dictionary in Pocketbook Format	luqmaninbmore	PocketBook	8	05-27-2020 10:41 AM
SW>EN Dictionary for Pocketbook	tttrine	PocketBook	3	06-09-2015 06:01 AM

03-07-2026, 06:06 AM	#391
nhedgehog Guru Posts: 833 Karma: 628976 Join Date: Sep 2013 Device: EnergySistemEreaderPro, Nook STG, Pocketbook 622, Bookeen Cybooks ...	Very cool, thanks for the effort and sharing it!

03-07-2026, 10:37 AM	#392
rkomar Wizard Posts: 3,111 Karma: 18944169 Join Date: Oct 2010 Location: Sudbury, ON, Canada Device: PRS-505, PB 902, PRS-T1, PB 623, PB 840, PB 633	Nice work, @datyoma! I'm a linux guy, and it was a pain running the PocketBook dictionary converter in a windows VM. It will be nice to have a native tool for that.

03-07-2026, 11:54 AM	#393
Markismus Guru Posts: 971 Karma: 149907 Join Date: Jul 2013 Location: Rotterdam Device: HiSenseA5ProCC, OnyxNotePro, Note5, Kobo Glo, Aura	Nice!

Today, 03:26 PM	#397
rkomar Wizard Posts: 3,111 Karma: 18944169 Join Date: Oct 2010 Location: Sudbury, ON, Canada Device: PRS-505, PB 902, PRS-T1, PB 623, PB 840, PB 633	What do you mean that morphems.txt is ignored? Ignored by the converter, ignored by the PB dictionary code on the device,...?