Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 05-17-2016, 07:33 PM   #1
acabal
Member
acabal can extract oil from cheeseacabal can extract oil from cheeseacabal can extract oil from cheeseacabal can extract oil from cheeseacabal can extract oil from cheeseacabal can extract oil from cheeseacabal can extract oil from cheeseacabal can extract oil from cheese
 
Posts: 15
Karma: 1000
Join Date: May 2016
Device: None
Hyphenate your ebook files from the command line

Hi everyone! Long time reader, first time poster.

I'm working on a free open-source ebook project called Standard Ebooks. Its goal is to bring classics that are free of copyright restrictions (i.e. public domain books) up to modern technological and editorial standards--in other words, to produce commercial-quality liberated ebooks for true book lovers.

Part of the "modern technological standards" bit is making an effort at supporting auto hyphenation in our ebooks. Ideally, since ebooks are basically just web pages, ereading software would simply use CSS's `hyphens` property to do that automatically. In reality, almost no ereading software has that ability right now. But lately a lot of ereading software has gained the ability to understand soft hyphens, and that's a small step in the right direction.

Of the major ereading platforms, at least Google Play Books and Kindle support soft hyphens. (This is Kindle's much-vaunted "enhanced typography" for the azw3 file type... ugh.)

I searched around for programs that could add soft hyphens automatically, and came across an excellent Calibre plugin that can do that via a GUI. But I needed to automate the process from the command line, so that we could automatically build compatible ebooks from our untainted epub3 sources. Browsing through that thread suggested that a few people were looking for a similar solution.

So I went ahead and created a Python script that will automatically add soft hyphens to the text of any xhtml file, and thus to ebooks. I thought I'd share it with you all in case someone found it helpful.

To install on Ubuntu 16.04:

Code:
#Make sure you have pip3 installed
sudo apt install python3-pip
	
#Install some dependencies
sudo pip3 install pyhyphen beatifulsoup4
	
#Download the script and make it executable
wget https://raw.githubusercontent.com/standardebooks/tools/master/hyphenate
chmod +x hyphenate
Adding soft hyphens to an epub file

The script operates on single xhtml files, but since epub files are just zip files filled with xhtml files, you can hyphenate a whole ebook by unzipping the epub, running hyphenate on all of the xhtml files within, and re-zipping it:

Code:
#Blow up our epub file
unzip mybook.epub -d mybook-extracted

#Hyphenate all (x)html files
find mybook-extracted -iname "*htm*" -exec hyphenate "{}" \;

#Rebuild our epub file (you may have to tweak this line a little)
zip -9 --no-dir-entries -X --recurse-paths mybook-hyphenated.epub mybook-extracted/mimetype mybook-extracted/META-INF mybook-extracted/OEBPS

Adding soft hyphens to Kindle ebook files

For those of you using Kindle devices or software, from my limited experiments it appears that only azw3 files support hyphenation right now. KFX files apparently hyphenate automatically and so don't need soft hyphens. To hyphenate an azw3 file, you can use Calibre to convert it to epub first, perform the hyphenation, then convert it back to azw3:

Code:
#Use Calibre's command-line tools to convert your Kindle book to epub
ebook-convert mybook.azw3 mybook.epub

#Perform the steps for epub as listed above

#After you've done that, use Calibre to convert back to azw3
ebook-convert mybook-hyphenated.epub mybook-hyphenated.azw3
A note on languages

pyhyphen requires that you install dictionaries for each language you want to process. I believe it downloads a dictionary for your system's default language when it's installed, but there are instructions on downloading additional dictionaries in the pyhyphen documentation.

The script tries to guess the xhtml file's language by looking for a `lang` attribute on the `<html>` element. If your files don't have one, you can force the script to use a specific language like so:

Code:
./hyphenate --language="en-US" myfile.xhtml
I hope someone finds this useful. We also have a few more command-line tools for processing ebooks that some of you might find helpful in our complete tools repository. This and all our tools are GPLv3 and contributions via Github are welcome. And if you'd like to volunteer at the Standard Ebooks project and bring a liberated classic up to our high standards, drop me a line!

Last edited by acabal; 05-18-2016 at 10:08 PM.
acabal is offline   Reply With Quote
Old 05-17-2016, 08:36 PM   #2
jhowell
Grand Sorcerer
jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.
 
jhowell's Avatar
 
Posts: 6,496
Karma: 84420419
Join Date: Nov 2011
Location: Tampa Bay, Florida
Device: Kindles
Sorry for the off-topic post, but I want to correct some misinformation.

Quote:
Originally Posted by acabal View Post
This is Kindle's much-vaunted "enhanced typography": instead of doing it the smart way with CSS hyphens on the firmware level, it seems they're instead planning on somehow adding soft hyphens to their entire ebook catalog... ugh.
There was speculation that soft hyphens were being used for enhanced typesetting, but this turned out to be false.

Amazon's enhanced typesetting relies on a proprietary e-book format: KFX. Soft hyphens are not present in books using this format. Instead language-specific hyphenation dictionaries are used to add hyphens when text is rendered.

The KFX renderer has these dictionaries: dicts/hyph_de.bin, dicts/hyph_en.bin, dicts/hyph_es.bin, dicts/hyph_fr.bin, dicts/hyph_it.bin, dicts/hyph_nl.bin, dicts/hyph_pt.bin, dicts/hyph_ru.bin.
jhowell is offline   Reply With Quote
Advert
Old 05-18-2016, 04:06 PM   #3
acabal
Member
acabal can extract oil from cheeseacabal can extract oil from cheeseacabal can extract oil from cheeseacabal can extract oil from cheeseacabal can extract oil from cheeseacabal can extract oil from cheeseacabal can extract oil from cheeseacabal can extract oil from cheese
 
Posts: 15
Karma: 1000
Join Date: May 2016
Device: None
Quote:
Originally Posted by jhowell View Post
Sorry for the off-topic post, but I want to correct some misinformation.



There was speculation that soft hyphens were being used for enhanced typesetting, but this turned out to be false.

Amazon's enhanced typesetting relies on a proprietary e-book format: KFX. Soft hyphens are not present in books using this format. Instead language-specific hyphenation dictionaries are used to add hyphens when text is rendered.

The KFX renderer has these dictionaries: dicts/hyph_de.bin, dicts/hyph_en.bin, dicts/hyph_es.bin, dicts/hyph_fr.bin, dicts/hyph_it.bin, dicts/hyph_nl.bin, dicts/hyph_pt.bin, dicts/hyph_ru.bin.
Are you certain? I just ran a test on a Kindle Voyage with 5.6.5: an epub file without soft hyphens converted to azw3 with Calibre does not hyphenate at all.

Adding soft hyphens to the same file with this script and converting with Calibre to the same format makes Kindle hyphenate.

Curious to see the discussion on this--is there a thread you can point me to?
acabal is offline   Reply With Quote
Old 05-18-2016, 07:46 PM   #4
jhowell
Grand Sorcerer
jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.
 
jhowell's Avatar
 
Posts: 6,496
Karma: 84420419
Join Date: Nov 2011
Location: Tampa Bay, Florida
Device: Kindles
Quote:
Originally Posted by acabal View Post
Are you certain? I just ran a test on a Kindle Voyage with 5.6.5: an epub file without soft hyphens converted to azw3 with Calibre does not hyphenate at all.

Adding soft hyphens to the same file with this script and converting with Calibre to the same format makes Kindle hyphenate.
You can make AZW3 files render with hyphenation by adding soft hyphens, but this isn't how Amazon does it. Enhanced typesetting uses the KFX file format, not AZW3. It uses a different rendering engine that performs hyphenation (and adds kerning and ligatures) without needing soft hyphens.

Quote:
Originally Posted by acabal View Post
Curious to see the discussion on this--is there a thread you can point me to?
This thread has information learned about KFX and enhanced typesetting since last summer.

ETA: I have also written a calibre plugin that can create KFX files in conjunction with Amazon's Kindle Previewer software.

Last edited by jhowell; 05-18-2016 at 07:55 PM.
jhowell is offline   Reply With Quote
Old 05-18-2016, 10:06 PM   #5
acabal
Member
acabal can extract oil from cheeseacabal can extract oil from cheeseacabal can extract oil from cheeseacabal can extract oil from cheeseacabal can extract oil from cheeseacabal can extract oil from cheeseacabal can extract oil from cheeseacabal can extract oil from cheese
 
Posts: 15
Karma: 1000
Join Date: May 2016
Device: None
Quote:
Originally Posted by jhowell View Post
You can make AZW3 files render with hyphenation by adding soft hyphens, but this isn't how Amazon does it. Enhanced typesetting uses the KFX file format, not AZW3. It uses a different rendering engine that performs hyphenation (and adds kerning and ligatures) without needing soft hyphens.



This thread has information learned about KFX and enhanced typesetting since last summer.

ETA: I have also written a calibre plugin that can create KFX files in conjunction with Amazon's Kindle Previewer software.
Ah, I see--I've updated the post to reflect that. Thanks for clearing that up
acabal is offline   Reply With Quote
Advert
Old 10-04-2016, 09:00 AM   #6
quiris
Groupie
quiris understands when you whisper 'The dog barks at midnight.'quiris understands when you whisper 'The dog barks at midnight.'quiris understands when you whisper 'The dog barks at midnight.'quiris understands when you whisper 'The dog barks at midnight.'quiris understands when you whisper 'The dog barks at midnight.'quiris understands when you whisper 'The dog barks at midnight.'quiris understands when you whisper 'The dog barks at midnight.'quiris understands when you whisper 'The dog barks at midnight.'quiris understands when you whisper 'The dog barks at midnight.'quiris understands when you whisper 'The dog barks at midnight.'quiris understands when you whisper 'The dog barks at midnight.'
 
quiris's Avatar
 
Posts: 195
Karma: 42216
Join Date: Oct 2013
Location: Poland
Device: Kindles: KOA1, KV
Quote:
Originally Posted by jhowell View Post
The KFX renderer has these dictionaries: dicts/hyph_de.bin, dicts/hyph_en.bin, dicts/hyph_es.bin, dicts/hyph_fr.bin, dicts/hyph_it.bin, dicts/hyph_nl.bin, dicts/hyph_pt.bin, dicts/hyph_ru.bin.
Where the files are placed?
quiris is offline   Reply With Quote
Old 10-04-2016, 02:05 PM   #7
jhowell
Grand Sorcerer
jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.
 
jhowell's Avatar
 
Posts: 6,496
Karma: 84420419
Join Date: Nov 2011
Location: Tampa Bay, Florida
Device: Kindles
Quote:
Originally Posted by quiris View Post
Where the files are placed?
They are packed into a KFX container file. The name and location varies by platform. For example, in Kindle firmware 5.8.2.1 the file is "/usr/share/yellowjersey/res-eink.dat" and in the Kindle Previewer 3.5 for Windows the file is "%localappdata%\Amazon\Kindle Previewer 3\res-win.dat".

They contain the following packed resources (among other things):

dicts/bin/hyph_en.bin
dicts/bin/hyph_de.bin
dicts/bin/hyph_it.bin
dicts/bin/hyph_pt.bin
dicts/bin/hyph_es.bin
dicts/bin/hyph_fr.bin
dicts/bin/hyph_ru.bin
dicts/bin/hyph_nl.bin
dicts/bin/hyph_deva.bin
dicts/bin/hyph_gujr.bin
dicts/bin/hyph_taml.bin
dicts/bin/hyph_mlym.bin
jhowell is offline   Reply With Quote
Old 10-04-2016, 03:09 PM   #8
quiris
Groupie
quiris understands when you whisper 'The dog barks at midnight.'quiris understands when you whisper 'The dog barks at midnight.'quiris understands when you whisper 'The dog barks at midnight.'quiris understands when you whisper 'The dog barks at midnight.'quiris understands when you whisper 'The dog barks at midnight.'quiris understands when you whisper 'The dog barks at midnight.'quiris understands when you whisper 'The dog barks at midnight.'quiris understands when you whisper 'The dog barks at midnight.'quiris understands when you whisper 'The dog barks at midnight.'quiris understands when you whisper 'The dog barks at midnight.'quiris understands when you whisper 'The dog barks at midnight.'
 
quiris's Avatar
 
Posts: 195
Karma: 42216
Join Date: Oct 2013
Location: Poland
Device: Kindles: KOA1, KV
Quote:
Originally Posted by jhowell View Post
res-win.dat
On Mac OS in Kindle Previewer 3 there is res-mac.dat file. Is it possible to unpack/repack/tinker with the resource files? The hyphen dictionaries are based on hyph dic files from OpenOffice:
From AttributionMac.txt:
Quote:
--------------------
a. [English - US]
--------------------

hyph_en_US.dic - American English hyphenation patterns for OpenOffice.org version 2010-02-23
The same dictionaries are used by Adobe Digital Editions. I wonder if is it possible to replace for example French dictionary with Polish hyphen dictionary to have hyphens in Polish ebooks with fr language set up...

Last edited by quiris; 10-04-2016 at 05:51 PM.
quiris is offline   Reply With Quote
Old 10-04-2016, 05:34 PM   #9
jhowell
Grand Sorcerer
jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.
 
jhowell's Avatar
 
Posts: 6,496
Karma: 84420419
Join Date: Nov 2011
Location: Tampa Bay, Florida
Device: Kindles
Quote:
Originally Posted by quiris View Post
On Mac OS in Kindle Previewer 3 there is res-mac.dat file. Is it possible to unpack/repack/tinker with the resource files?
You could substitute the data, but unfortunately there isn't (yet) any software that can re-package it.

Last edited by jhowell; 01-18-2017 at 08:37 AM. Reason: Remove reference
jhowell is offline   Reply With Quote
Old 10-05-2016, 03:13 AM   #10
quiris
Groupie
quiris understands when you whisper 'The dog barks at midnight.'quiris understands when you whisper 'The dog barks at midnight.'quiris understands when you whisper 'The dog barks at midnight.'quiris understands when you whisper 'The dog barks at midnight.'quiris understands when you whisper 'The dog barks at midnight.'quiris understands when you whisper 'The dog barks at midnight.'quiris understands when you whisper 'The dog barks at midnight.'quiris understands when you whisper 'The dog barks at midnight.'quiris understands when you whisper 'The dog barks at midnight.'quiris understands when you whisper 'The dog barks at midnight.'quiris understands when you whisper 'The dog barks at midnight.'
 
quiris's Avatar
 
Posts: 195
Karma: 42216
Join Date: Oct 2013
Location: Poland
Device: Kindles: KOA1, KV
Quote:
Originally Posted by jhowell View Post
"value" field has the binary content expressed in base-64.
Unfortunately… Decoded hyph_*.bin file isn't the same plain text file as hyph_*.dic from openoffice. They have to mangled it in binary form
quiris is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Using ebook-editor in command line Francois_C Editor 12 04-17-2015 10:17 AM
bulk import of metadata using files and command line morquai Library Management 1 08-25-2014 03:42 PM
command-line tool for inspecting .mobi files gonzoua Kindle Formats 2 10-29-2012 06:15 AM
Command line ebook viewer? anoved Reading and Management 1 02-13-2012 02:32 PM


All times are GMT -4. The time now is 08:17 PM.


MobileRead.com is a privately owned, operated and funded community.