05-17-2016, 07:33 PM | #1 |
Member
Posts: 15
Karma: 1000
Join Date: May 2016
Device: None
|
Hyphenate your ebook files from the command line
Hi everyone! Long time reader, first time poster.
I'm working on a free open-source ebook project called Standard Ebooks. Its goal is to bring classics that are free of copyright restrictions (i.e. public domain books) up to modern technological and editorial standards--in other words, to produce commercial-quality liberated ebooks for true book lovers. Part of the "modern technological standards" bit is making an effort at supporting auto hyphenation in our ebooks. Ideally, since ebooks are basically just web pages, ereading software would simply use CSS's `hyphens` property to do that automatically. In reality, almost no ereading software has that ability right now. But lately a lot of ereading software has gained the ability to understand soft hyphens, and that's a small step in the right direction. Of the major ereading platforms, at least Google Play Books and Kindle support soft hyphens. (This is Kindle's much-vaunted "enhanced typography" for the azw3 file type... ugh.) I searched around for programs that could add soft hyphens automatically, and came across an excellent Calibre plugin that can do that via a GUI. But I needed to automate the process from the command line, so that we could automatically build compatible ebooks from our untainted epub3 sources. Browsing through that thread suggested that a few people were looking for a similar solution. So I went ahead and created a Python script that will automatically add soft hyphens to the text of any xhtml file, and thus to ebooks. I thought I'd share it with you all in case someone found it helpful. To install on Ubuntu 16.04: Code:
#Make sure you have pip3 installed sudo apt install python3-pip #Install some dependencies sudo pip3 install pyhyphen beatifulsoup4 #Download the script and make it executable wget https://raw.githubusercontent.com/standardebooks/tools/master/hyphenate chmod +x hyphenate The script operates on single xhtml files, but since epub files are just zip files filled with xhtml files, you can hyphenate a whole ebook by unzipping the epub, running hyphenate on all of the xhtml files within, and re-zipping it: Code:
#Blow up our epub file unzip mybook.epub -d mybook-extracted #Hyphenate all (x)html files find mybook-extracted -iname "*htm*" -exec hyphenate "{}" \; #Rebuild our epub file (you may have to tweak this line a little) zip -9 --no-dir-entries -X --recurse-paths mybook-hyphenated.epub mybook-extracted/mimetype mybook-extracted/META-INF mybook-extracted/OEBPS Adding soft hyphens to Kindle ebook files For those of you using Kindle devices or software, from my limited experiments it appears that only azw3 files support hyphenation right now. KFX files apparently hyphenate automatically and so don't need soft hyphens. To hyphenate an azw3 file, you can use Calibre to convert it to epub first, perform the hyphenation, then convert it back to azw3: Code:
#Use Calibre's command-line tools to convert your Kindle book to epub ebook-convert mybook.azw3 mybook.epub #Perform the steps for epub as listed above #After you've done that, use Calibre to convert back to azw3 ebook-convert mybook-hyphenated.epub mybook-hyphenated.azw3 pyhyphen requires that you install dictionaries for each language you want to process. I believe it downloads a dictionary for your system's default language when it's installed, but there are instructions on downloading additional dictionaries in the pyhyphen documentation. The script tries to guess the xhtml file's language by looking for a `lang` attribute on the `<html>` element. If your files don't have one, you can force the script to use a specific language like so: Code:
./hyphenate --language="en-US" myfile.xhtml Last edited by acabal; 05-18-2016 at 10:08 PM. |
05-17-2016, 08:36 PM | #2 | |
Grand Sorcerer
Posts: 6,496
Karma: 84420419
Join Date: Nov 2011
Location: Tampa Bay, Florida
Device: Kindles
|
Sorry for the off-topic post, but I want to correct some misinformation.
Quote:
Amazon's enhanced typesetting relies on a proprietary e-book format: KFX. Soft hyphens are not present in books using this format. Instead language-specific hyphenation dictionaries are used to add hyphens when text is rendered. The KFX renderer has these dictionaries: dicts/hyph_de.bin, dicts/hyph_en.bin, dicts/hyph_es.bin, dicts/hyph_fr.bin, dicts/hyph_it.bin, dicts/hyph_nl.bin, dicts/hyph_pt.bin, dicts/hyph_ru.bin. |
|
Advert | |
|
05-18-2016, 04:06 PM | #3 | |
Member
Posts: 15
Karma: 1000
Join Date: May 2016
Device: None
|
Quote:
Adding soft hyphens to the same file with this script and converting with Calibre to the same format makes Kindle hyphenate. Curious to see the discussion on this--is there a thread you can point me to? |
|
05-18-2016, 07:46 PM | #4 | ||
Grand Sorcerer
Posts: 6,496
Karma: 84420419
Join Date: Nov 2011
Location: Tampa Bay, Florida
Device: Kindles
|
Quote:
Quote:
ETA: I have also written a calibre plugin that can create KFX files in conjunction with Amazon's Kindle Previewer software. Last edited by jhowell; 05-18-2016 at 07:55 PM. |
||
05-18-2016, 10:06 PM | #5 | |
Member
Posts: 15
Karma: 1000
Join Date: May 2016
Device: None
|
Quote:
|
|
Advert | |
|
10-04-2016, 09:00 AM | #6 |
Groupie
Posts: 195
Karma: 42216
Join Date: Oct 2013
Location: Poland
Device: Kindles: KOA1, KV
|
|
10-04-2016, 02:05 PM | #7 |
Grand Sorcerer
Posts: 6,496
Karma: 84420419
Join Date: Nov 2011
Location: Tampa Bay, Florida
Device: Kindles
|
They are packed into a KFX container file. The name and location varies by platform. For example, in Kindle firmware 5.8.2.1 the file is "/usr/share/yellowjersey/res-eink.dat" and in the Kindle Previewer 3.5 for Windows the file is "%localappdata%\Amazon\Kindle Previewer 3\res-win.dat".
They contain the following packed resources (among other things): dicts/bin/hyph_en.bin dicts/bin/hyph_de.bin dicts/bin/hyph_it.bin dicts/bin/hyph_pt.bin dicts/bin/hyph_es.bin dicts/bin/hyph_fr.bin dicts/bin/hyph_ru.bin dicts/bin/hyph_nl.bin dicts/bin/hyph_deva.bin dicts/bin/hyph_gujr.bin dicts/bin/hyph_taml.bin dicts/bin/hyph_mlym.bin |
10-04-2016, 03:09 PM | #8 | |
Groupie
Posts: 195
Karma: 42216
Join Date: Oct 2013
Location: Poland
Device: Kindles: KOA1, KV
|
On Mac OS in Kindle Previewer 3 there is res-mac.dat file. Is it possible to unpack/repack/tinker with the resource files? The hyphen dictionaries are based on hyph dic files from OpenOffice:
From AttributionMac.txt: Quote:
Last edited by quiris; 10-04-2016 at 05:51 PM. |
|
10-04-2016, 05:34 PM | #9 |
Grand Sorcerer
Posts: 6,496
Karma: 84420419
Join Date: Nov 2011
Location: Tampa Bay, Florida
Device: Kindles
|
You could substitute the data, but unfortunately there isn't (yet) any software that can re-package it.
Last edited by jhowell; 01-18-2017 at 08:37 AM. Reason: Remove reference |
10-05-2016, 03:13 AM | #10 |
Groupie
Posts: 195
Karma: 42216
Join Date: Oct 2013
Location: Poland
Device: Kindles: KOA1, KV
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Using ebook-editor in command line | Francois_C | Editor | 12 | 04-17-2015 10:17 AM |
bulk import of metadata using files and command line | morquai | Library Management | 1 | 08-25-2014 03:42 PM |
command-line tool for inspecting .mobi files | gonzoua | Kindle Formats | 2 | 10-29-2012 06:15 AM |
Command line ebook viewer? | anoved | Reading and Management | 1 | 02-13-2012 02:32 PM |