View Single Post
Old 05-17-2016, 07:33 PM   #1
acabal
Member
acabal can extract oil from cheeseacabal can extract oil from cheeseacabal can extract oil from cheeseacabal can extract oil from cheeseacabal can extract oil from cheeseacabal can extract oil from cheeseacabal can extract oil from cheeseacabal can extract oil from cheese
 
Posts: 15
Karma: 1000
Join Date: May 2016
Device: None
Hyphenate your ebook files from the command line

Hi everyone! Long time reader, first time poster.

I'm working on a free open-source ebook project called Standard Ebooks. Its goal is to bring classics that are free of copyright restrictions (i.e. public domain books) up to modern technological and editorial standards--in other words, to produce commercial-quality liberated ebooks for true book lovers.

Part of the "modern technological standards" bit is making an effort at supporting auto hyphenation in our ebooks. Ideally, since ebooks are basically just web pages, ereading software would simply use CSS's `hyphens` property to do that automatically. In reality, almost no ereading software has that ability right now. But lately a lot of ereading software has gained the ability to understand soft hyphens, and that's a small step in the right direction.

Of the major ereading platforms, at least Google Play Books and Kindle support soft hyphens. (This is Kindle's much-vaunted "enhanced typography" for the azw3 file type... ugh.)

I searched around for programs that could add soft hyphens automatically, and came across an excellent Calibre plugin that can do that via a GUI. But I needed to automate the process from the command line, so that we could automatically build compatible ebooks from our untainted epub3 sources. Browsing through that thread suggested that a few people were looking for a similar solution.

So I went ahead and created a Python script that will automatically add soft hyphens to the text of any xhtml file, and thus to ebooks. I thought I'd share it with you all in case someone found it helpful.

To install on Ubuntu 16.04:

Code:
#Make sure you have pip3 installed
sudo apt install python3-pip
	
#Install some dependencies
sudo pip3 install pyhyphen beatifulsoup4
	
#Download the script and make it executable
wget https://raw.githubusercontent.com/standardebooks/tools/master/hyphenate
chmod +x hyphenate
Adding soft hyphens to an epub file

The script operates on single xhtml files, but since epub files are just zip files filled with xhtml files, you can hyphenate a whole ebook by unzipping the epub, running hyphenate on all of the xhtml files within, and re-zipping it:

Code:
#Blow up our epub file
unzip mybook.epub -d mybook-extracted

#Hyphenate all (x)html files
find mybook-extracted -iname "*htm*" -exec hyphenate "{}" \;

#Rebuild our epub file (you may have to tweak this line a little)
zip -9 --no-dir-entries -X --recurse-paths mybook-hyphenated.epub mybook-extracted/mimetype mybook-extracted/META-INF mybook-extracted/OEBPS

Adding soft hyphens to Kindle ebook files

For those of you using Kindle devices or software, from my limited experiments it appears that only azw3 files support hyphenation right now. KFX files apparently hyphenate automatically and so don't need soft hyphens. To hyphenate an azw3 file, you can use Calibre to convert it to epub first, perform the hyphenation, then convert it back to azw3:

Code:
#Use Calibre's command-line tools to convert your Kindle book to epub
ebook-convert mybook.azw3 mybook.epub

#Perform the steps for epub as listed above

#After you've done that, use Calibre to convert back to azw3
ebook-convert mybook-hyphenated.epub mybook-hyphenated.azw3
A note on languages

pyhyphen requires that you install dictionaries for each language you want to process. I believe it downloads a dictionary for your system's default language when it's installed, but there are instructions on downloading additional dictionaries in the pyhyphen documentation.

The script tries to guess the xhtml file's language by looking for a `lang` attribute on the `<html>` element. If your files don't have one, you can force the script to use a specific language like so:

Code:
./hyphenate --language="en-US" myfile.xhtml
I hope someone finds this useful. We also have a few more command-line tools for processing ebooks that some of you might find helpful in our complete tools repository. This and all our tools are GPLv3 and contributions via Github are welcome. And if you'd like to volunteer at the Standard Ebooks project and bring a liberated classic up to our high standards, drop me a line!

Last edited by acabal; 05-18-2016 at 10:08 PM.
acabal is offline   Reply With Quote