MobileRead Forums - View Single Post - QuickDicBuilder: Custom dictionaries on the Tolino

Peripathetic · 02-09-2020, 04:47 PM

Dictionaries used by the Tolino app are stored under .tolino/dictionaries/ on the user data partition. The format used is that of QuickDic (*.quickdic).

Existing Dictionaries

The original QuickDic was an Android app written by Thad Hughes and eventually open-sourced. Dictionary files were hosted on Google Code and available for download but all of them got deleted and were apparently lost when Google shut down the website. A Web Archive snapshot of the project repository is available but files cannot be downloaded this way.

The project was later resurrected as QuickDic Restored by Reimar Döffinger. The author's repository contains a lot of dictionaries generated from Wiktionary, a sister project of Wikipedia, which was also the source of the original QuickDic data. However, as part of his work on the app, the author improved the dictionary format, which means that newer dictionaries (v007 instead of v006) are no longer compatible with the Tolino.

These Wiktionary-based dictionaries can be downloaded on GitHub:

Make sure to download the files labeled v006 only.

Creating Dictionaries: The Tool

DictionaryPC is a Java tool for generating QuickDic dictionaries accompanying the QuickDic app:

https://github.com/rdoeffinger/DictionaryPC

GitHub user Gitsaibot authored shell scripts for generating QuickDic dictionaries specifically with the Tolino in mind (the .jar file here is exactly the same as in the original project):

https://github.com/Gitsaibot/Toligen

Since it is a Java application, it needs JRE to run (portable version). Further, it requires the following classes: Common Compress, Common Lang3, International Components for Unicode, Xerces-J Impl.

For convenience, I packaged everything necessary to run it in a Windows environment into a single archive, which I named QuickDicBuilder. Here's how to use it:

Download and unpack: QuickDicBuilder.zip
Edit QuickDicBuilder.cmd and set JAVA_EXE to point to the Java binary on your system.
QuickDicBuilder can now be called just like any other command-line utility.

Note: Thad Hughes are Reimar Döffinger are the original authors, I am only redistributing this. For source code, please refer to the GitHub links above.

Creating Dictionaries: How to Use It

The dictionary generation tool is functional but not very well documented. Some extra information how it is supposed to be used can be obtained by reading old, closed GitHub issues and its source code.

The utility supports several input formats: "Wiktionary", "tab_separated", and "Chemnitz". The latter format follows that of several German dictionaries available here. Tab-separated is the most straightforward format to use. Perhaps it's best to illustrate how to use it by example.

Case #1: Dict.cc

Dict.cc dictionaries can be downloaded (for personal use) from:
https://www1.dict.cc/translation_file_request.php

I downloaded their Russian-English dictionary, and converted it to QuickDic format with the following command:

QuickDicBuilder --dictInfo="Dict.cc Russian-English" --dictOut="RU-EN_DictCC.quickdic" --input1="dictcc.ru-en.txt" --input1Charset=UTF8 --input1Format=tab_separated --input1Name="dictcc" --lang1="RU" --lang1Stoplist="StopLists\xx.txt" --lang2="EN"

I did not have a Russian stoplist so I used an empty one. Stoplists include frequently-appearing words that should be dropped from index. It'd probably be better to use one.

This conversion is relatively easy because the format of the downloaded file follows what the utility expects as its "tab_separated" input.

Case #2: CC-CEDICT

CC-CEDICT is a Chinese-English dictionary that can be downloaded from:
https://www.mdbg.net/chinese/dictionary?page=cc-cedict

Here, the conversion command was:

QuickDicBuilder --dictInfo="CC-CEDICT Chinese-English" --dictOut="CC-CEDICT.quickdic" --input1="cedict_ts.txt" --input1Charset=UTF8 --input1Format=tab_separated --input1Name="cc-cedict" --lang1="ZH" --lang1Stoplist="StopLists\xx.txt" --lang2="EN" --lang1Stoplist="StopLists\en.txt"

However, the input data needed to be rearranged first from:
SimplifiedHeadword TraditionalHeadword [Pronunciation] Definition
to:
SimplifiedHeadword TraditionalHeadword<Tab>Definition /Pronunciation/

For this purpose I used the following regular expression with sed:

sed -e "s/^ *$[^ ]*$ $[^ ]*$ *\[ *$.*$ *\] *\/ *$.*$ *\/.*$/\1 \2\t\4 \/\3\//g" cedict_ts.u8 > cedict_ts.txt

Results

This was done quickly just to check if it works but if you want to, you can download the dictionary files I generated.

02-09-2020, 04:47 PM	#1
Peripathetic Enthusiast Posts: 38 Karma: 90402 Join Date: Feb 2019 Device: Tolino Shine 3	QuickDicBuilder: Custom dictionaries on the Tolino Dictionaries used by the Tolino app are stored under .tolino/dictionaries/ on the user data partition. The format used is that of QuickDic (.quickdic). Existing Dictionaries* The original QuickDic was an Android app written by Thad Hughes and eventually open-sourced. Dictionary files were hosted on Google Code and available for download but all of them got deleted and were apparently lost when Google shut down the website. A Web Archive snapshot of the project repository is available but files cannot be downloaded this way. The project was later resurrected as QuickDic Restored by Reimar Döffinger. The author's repository contains a lot of dictionaries generated from Wiktionary, a sister project of Wikipedia, which was also the source of the original QuickDic data. However, as part of his work on the app, the author improved the dictionary format, which means that newer dictionaries (v007 instead of v006) are no longer compatible with the Tolino. These Wiktionary-based dictionaries can be downloaded on GitHub: https://github.com/rdoeffinger/Dicti...2-dictionaries https://github.com/rdoeffinger/Dicti...onary_info.txt (list of links) Make sure to download the files labeled v006 only. Creating Dictionaries: The Tool DictionaryPC is a Java tool for generating QuickDic dictionaries accompanying the QuickDic app: https://github.com/rdoeffinger/DictionaryPC GitHub user Gitsaibot authored shell scripts for generating QuickDic dictionaries specifically with the Tolino in mind (the .jar file here is exactly the same as in the original project): https://github.com/Gitsaibot/Toligen Since it is a Java application, it needs JRE to run (portable version). Further, it requires the following classes: Common Compress, Common Lang3, International Components for Unicode, Xerces-J Impl. For convenience, I packaged everything necessary to run it in a Windows environment into a single archive, which I named QuickDicBuilder. Here's how to use it: Download and unpack: QuickDicBuilder.zip Edit QuickDicBuilder.cmd and set JAVA_EXE to point to the Java binary on your system. QuickDicBuilder can now be called just like any other command-line utility. Note: Thad Hughes are Reimar Döffinger are the original authors, I am only redistributing this. For source code, please refer to the GitHub links above. Creating Dictionaries: How to Use It The dictionary generation tool is functional but not very well documented. Some extra information how it is supposed to be used can be obtained by reading old, closed GitHub issues and its source code. The utility supports several input formats: "Wiktionary", "tab_separated", and "Chemnitz". The latter format follows that of several German dictionaries available here. Tab-separated is the most straightforward format to use. Perhaps it's best to illustrate how to use it by example. Case #1: Dict.cc Dict.cc dictionaries can be downloaded (for personal use) from: https://www1.dict.cc/translation_file_request.php I downloaded their Russian-English dictionary, and converted it to QuickDic format with the following command: QuickDicBuilder --dictInfo="Dict.cc Russian-English" --dictOut="RU-EN_DictCC.quickdic" --input1="dictcc.ru-en.txt" --input1Charset=UTF8 --input1Format=tab_separated --input1Name="dictcc" --lang1="RU" --lang1Stoplist="StopLists\xx.txt" --lang2="EN" I did not have a Russian stoplist so I used an empty one. Stoplists include frequently-appearing words that should be dropped from index. It'd probably be better to use one. This conversion is relatively easy because the format of the downloaded file follows what the utility expects as its "tab_separated" input. Case #2: CC-CEDICT CC-CEDICT is a Chinese-English dictionary that can be downloaded from: https://www.mdbg.net/chinese/dictionary?page=cc-cedict Here, the conversion command was: QuickDicBuilder --dictInfo="CC-CEDICT Chinese-English" --dictOut="CC-CEDICT.quickdic" --input1="cedict_ts.txt" --input1Charset=UTF8 --input1Format=tab_separated --input1Name="cc-cedict" --lang1="ZH" --lang1Stoplist="StopLists\xx.txt" --lang2="EN" --lang1Stoplist="StopLists\en.txt" However, the input data needed to be rearranged first from: SimplifiedHeadword TraditionalHeadword [Pronunciation] Definition to: SimplifiedHeadword TraditionalHeadword<Tab>Definition /Pronunciation/ For this purpose I used the following regular expression with sed: *sed -e "s/^ \([^ ]\) \([^ ]\) \[ \(.\) \] \/ \(.\) \/.$/\1 \2\t\4 \/\3\//g" cedict_ts.u8 > cedict_ts.txt* Results This was done quickly just to check if it works but if you want to, you can download the dictionary files I generated.