View Full Version : Looking for a tool to help strip fonts of uncessary characters


eriĉjo
06-28-2010, 01:56 AM
I am hand-converting a PDF book to ePub and have run into a problem. I don't want to use the PDF fonts because they are licensed in a proprietary format. So, I'm using as close as possible open source equivalents. One problem I am running into is that I am importing the entire font into the ePub file, making it much larger than it really needs to be. What I'd like to do is strip all unnecessary characters from the font so that it is as small as possible. I know how to do this with FontForge, but what I don't know how to do is determine which unicode characters, exactly, are used in a given work. The author likes to use various characters here and there beyond the normal Esperanto ones (most of the English alphabet, plus ĉĈĝĜĥĤĵĴŝŜŭŬ). I'm worried about missing various characters, I would have to examine the whole document by hand if I guessed. Is it possible to trick Acrobat Pro to do it for me (by converting it to PDF and then getting Acrobat to do it)? Are there scripts for detecting which characters are used in a Unicode text file? Any ideas for solving this problem are appreciated.

JvdW
06-28-2010, 06:18 AM
I am hand-converting a PDF book to ePub and have run into a problem. I don't want to use the PDF fonts because they are licensed in a proprietary format. So, I'm using as close as possible open source equivalents. One problem I am running into is that I am importing the entire font into the ePub file, making it much larger than it really needs to be. What I'd like to do is strip all unnecessary characters from the font so that it is as small as possible. I know how to do this with FontForge, but what I don't know how to do is determine which unicode characters, exactly, are used in a given work. The author likes to use various characters here and there beyond the normal Esperanto ones (most of the English alphabet, plus ĉĈĝĜĥĤĵĴŝŜŭŬ). I'm worried about missing various characters, I would have to examine the whole document by hand if I guessed. Is it possible to trick Acrobat Pro to do it for me (by converting it to PDF and then getting Acrobat to do it)? Are there scripts for detecting which characters are used in a Unicode text file? Any ideas for solving this problem are appreciated.

The following link might help you get started:
http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=UnicodeCharacterCount

If you're handy with perl you might be able to use the counts to write a FontForge script to automatically generate the stripped font.

Regards,

Joop

eriĉjo
06-28-2010, 09:44 PM
Yes, thank you, this script helped a lot. In fact a script to process the font automatically might be nice, but FontForge already allows you to select font glyphs by unicode number, so I've decided to do this one by hand. If I end up having to do this quite often, I'll create a script and share it.

Keeping a record for myself and for those who come after me doing the same thing:

Downloaded just about every epub reader for the Mac trying to "cut-n-paste" the entire book so that I could feed it into the above script. Only "Stanza" was able to adequately perform this function. ADE, Calibre, and FBreader were all completely unsuitable to the task. Various word processors also failed to perserve the Esperanto characters. The Mac Terminal with vi worked fine though.

eriĉjo
06-29-2010, 10:44 PM
It's done! For anyone who likes short stories in Esperanto which are readable on a Sony in the ePub format (no doubt there are millions of you out there), here you go:

http://timwestover.com/marvirinstrato/?page_id=7

And thank you JvdW for your help.

JvdW
06-30-2010, 04:40 AM
It's done!

And thank you JvdW for your help.

You're welcome. Part of it goes to Google which helped me to find it. I have to admit I didn't come up with that link the first time around but being a bit creative about the search terms and reading between the lines got me there.

The epub looks nice in ADE but in Sigil (0.23) it looks like its using a bitmap font instead of the included truetypes.

Regards,

Joop