Thread: Bilingual books
View Single Post
Old 08-01-2016, 04:55 PM   #8
slex
Addict
slex ought to be getting tired of karma fortunes by now.slex ought to be getting tired of karma fortunes by now.slex ought to be getting tired of karma fortunes by now.slex ought to be getting tired of karma fortunes by now.slex ought to be getting tired of karma fortunes by now.slex ought to be getting tired of karma fortunes by now.slex ought to be getting tired of karma fortunes by now.slex ought to be getting tired of karma fortunes by now.slex ought to be getting tired of karma fortunes by now.slex ought to be getting tired of karma fortunes by now.slex ought to be getting tired of karma fortunes by now.
 
Posts: 294
Karma: 1196776
Join Date: Nov 2008
Location: Bulgaria
Device: Kindle 4 NT, Onyx Boox M92
A step-by-step guide how to make bilingual books

I know I'm bumping an old thread, but I think some people might find it useful. It's a fairly quick process, once you have the right tools and set them up. I use Linux, but you should be able to make it work on other operation systems.

You need Perl, Python, a program to split a text into sentences, a program to align the text, and Calibre. All the programs are, as far as I know, free software.


The program to split the sentences (Perl script) is here (go to Download -> Tools):
http://www.statmt.org/europarl/

It is not necessary, but it makes it easier for the text aligning program and improves the results.

The program to align the text is here:
http://mokk.bme.hu/resources/hunalign/

You can do without a dictionary, but with a dictionary it is better. The dictionary is just a text file in the following format (assuming that I want to have German as source and English as destination language):

englishword1 @ germanword1
englishword2 @ germanword2
...


It might seem like a lot of work to do, but you can take the shortcut. Google "most frequently used words in English" and you will get a list that will do the job. Then copy/paste in Google translate and select German as destination language. Use a spreadsheet program (or other tools) to create the dictionary file by copy/pasting. Then save as a plain text with spaces as separators or just paste into a text file again. Note that you should always use Unicode. Save the file as "en-de.dic" and you are ready to go.

Get the sources of the texts and convert them to text format. I used "The Sign of the Four" from the MobileRead Library in German and in English. Clean the files so that they have only the main text, as there are differences in the table of contents.

Below I assume that all your scripts and files are in the same directory.

You can convert using Calibre:

Code:
ebook-convert engtext.epub engtext.txt

ebook-convert deutext.epub deutext.txt
Use the sentence splitter:

Code:
perl split-sentences.perl -l en < engtext.txt > en.txt

perl split-sentences.perl -l de < deutext.txt > de.txt

Use the align texts program.

Code:
hunalign -text en-de.dic de.txt en.txt > book.txt
You have a text file that needs to be processed further. I found a Python script here that produces a html file from the result in a table format. It gives good results for the html file, but for a 6 inch reader it is not a very good option. I modified it to reduce the text of the destination language and made it gray and italic (find the script as an attachment).

Code:
python hun2htmlgray14.py book.txt > book.html

And then again Calibre, with the option to linearize tables:

Code:
ebook-convert book.html book.epub --linearize-tables --authors "Arthur Conan Doyle" --title "The Sign of the Four"
I've attached the final result. It looks good in Calibre and in the stock reader in Nook Simple Touch. It is possible that in other readers it will be different, but it works for me.
Attached Files
File Type: epub The Sign of the Four - DE-EN.epub (261.2 KB, 355 views)
File Type: zip hun2htmlscripts.zip (2.2 KB, 430 views)
slex is offline   Reply With Quote