FontShrinker - tool to subset a font - Page 2

Hitch · 01-23-2013, 07:59 PM

Tox:

Stupid question: NEVER MIND. DUH.

H

Toxaris · 01-24-2013, 02:29 AM

Now I am curious what the question was...

Hitch · 01-24-2013, 03:10 AM

Quote:

Originally Posted by Toxaris

Now I am curious what the question was...

IN short, it was: for those of us who won't know the final character set until after the ePUB is basically created, and have 20-30 xhtml files....is there an easier way to obtain all the text than copy-and-pasting each of the xhtml files into the box? And, is there any way that you can see to incorporate this with, say, ePUBtweak.exe, in that vein? So that Font Shrinker could scour the exploded files when you have ePUBtweak open, and obtain the character sets that way?

H.

Toxaris · 01-24-2013, 04:06 AM

For now, the only method is copy/pasting. I know that is not always handy, but it was the easiest to do and I needed the program now. I see the added value in having it reading ePUB and/or XHTML, but that will be some work. The main problem would be in identifying only the required characters in a class.
I will take a look at ePUBtweak to see if I can use the output from it. It might be a good idea.

meme · 01-24-2013, 04:16 AM

The next version of Sigil will have a report listing all the characters visible in Book View. It's not by class though, so to limit it to sections of text would still require you to do some work. It may be it needs to be changed to use Code View - this might allow seeing what is in a class but it would be guessing what is actually visible in Book View (e.g. if a style hides the text using display:none or similar).

Toxaris · 01-24-2013, 05:05 AM

I have thought about this for a while. I think I will start working on the following in the weekend (depends on a lot of personal stuff...):
- ability to select an ePUB
- parse XHTML to find all characters in use by a certain CSS class
- open the used fonts in the ePUB and shrink it according to the used characters for that font
- replace the fonts in the ePUB by the shrinked ones.

Don't expect it to be ready soon though, it needs quite some testing and the most difficult part will probably be the parsing of the stylesheet to find the classes where a font is defined/used. It might be that an intermediate version will be created where the styles class names have to be entered manually.

As as special service to JSWolf (

) I will automatically add the ligatures to the unique characters used.

Hitch · 01-24-2013, 11:26 AM

Quote:

Originally Posted by Toxaris

I have thought about this for a while. I think I will start working on the following in the weekend (depends on a lot of personal stuff...):
- ability to select an ePUB
- parse XHTML to find all characters in use by a certain CSS class
- open the used fonts in the ePUB and shrink it according to the used characters for that font
- replace the fonts in the ePUB by the shrinked ones.

Don't expect it to be ready soon though, it needs quite some testing and the most difficult part will probably be the parsing of the stylesheet to find the classes where a font is defined/used. It might be that an intermediate version will be created where the styles class names have to be entered manually.

As as special service to JSWolf (

) I will automatically add the ligatures to the unique characters used.

Well, that's a hell of a wishlist, and it would rock, but I'd be thrilled if it could simply peruse an ePUB for all the characters used in that ePUB, even if the classes are not discovered. By which I mean: let's say I have two fonts. One for the body; one for the chapter heads. By definition, the font for the body will have more characters, in all likelihood. However, I wouldn't care, at this point in time, if I had to feed the Shrinker all the chars in the ePUB, to shrink the Chapter head font.

To have it perfect, later, would be, as I said, amazing, but right this second, what I'd love is if it could just open the ePUB and say, "VOILA!" I don't even care if I have to manually replace the fonts, that's not a big deal.

Not that I'd turn DOWN Shrinker with all the extra goodies...just thinking aloud about what I, personally, need most. I realize my needs are probably different than almost everyone else's.

OH, also: a way to direct the location of the output of the created subsetted font would be super. While I'm wish-listing.

And if I didn't say it loudly enough, before: seriously, you are fabulous.

H

Freeshadow · 01-24-2013, 12:31 PM

I'm really enthusiastic that this particular idea of epub tweaking found that much positive resonance - really no joking here.

Tox: while you work on manipulation of the font files you should consider auto-renaming them: both filename & the font name stored inside the font file. AFAIR It's often required even in licences of free fonts when they are changed. While it's relatively meaningless for personal uses it's crucial as soon as your tool matures to become a part of the toolchain used by professional producers. (and aren't more optimized professional books a goal we all wish for?)

JSWolf · 01-24-2013, 12:47 PM

Quote:

Originally Posted by grannyGrumpy

Super! Would you like to be adopted?

After hearing about the problem with ligatures from Calibre subsetted fonts, I'm curious. Has anybody checked on whether this handles ligatures ok?

Thank you Toxaris, you get ten gold stars.

I did report the problem with ligatures and that has been fixed in Calibre.

JSWolf · 01-24-2013, 12:49 PM

Quote:

Originally Posted by Toxaris

As as special service to JSWolf (

) I will automatically add the ligatures to the unique characters used.

Thank you very much! This will help for sure.

I've figured out what would work with the current version. Take the ePub, convert it to HTMLZ and run the HTML file through the subsetter and there you go.

meme · 01-24-2013, 12:55 PM

I'm not sure. Do you have an example epub (a link or just a small file is fine) that contains ligatures? The code literally just reports each unicode character that appears in the text (and if it has an entity name).

JSWolf · 01-24-2013, 01:01 PM

Quote:

Originally Posted by meme

I'm not sure. Do you have an example epub (a link or just a small file is fine) that contains ligatures? The code literally just reports each unicode character that appears in the text (and if it has an entity name).

The ePub does not contain the ligatures. ADE 2.0 and Calibre (and maybe other reading software) converts to using ligatures. So for example, if your text have a word such as flight, the fl will be converted to the ligature and displayed that way. Your code would have to handle fl as separate fl and as the ligature for reading software that does and does not convert to ligatures.

Oh and would it be possible to display each character for a given font for embedded fonts?

Jellby · 01-24-2013, 01:20 PM

I'll try to explain. A text typically has no explicit ligatures, it could have some, but it should not, and I've only seen some very old text files with them. What a text has is just normal unicode characters, let's say a text consists of the single word "office", that's only 5 different letters: c, e, f, i, o.

Now, a font could have ligatures defined, and a reading software may use them (although many do not, I'm afraid). Let's say that the font we are dealing with has the ligatures "fi", "ffi" and "fj" defined. Defining a ligature means that the font has a glyph (a character shape) for the combination "fi" and some instructions saying that whenever there's an f and an i in the text, they should be rendered as the "fi" ligature and not as the separate characters (ditto for "ffi" and "fj").

OK, then our text will ideally be displayed as 4 glyphs: "o", "ffi", "c" "e". There are different things a font subsetter could do:

1) Remove everything but "o", "f", "i", "c", "e", including ligatures and their definition. This is not ideal, but it's probably the simplest.

2) Same as 1, but do not remove ligatures or their definition. That's much better, but it leaves unused glyphs, such as "fi" or "fj".

3) Detect ligatures, find out that "i" and "f" are never used alone, and remove everything but "o", "ffi", "c", "e". This is not a good idea, as renderers that do not support ligatures will not be able to display "f" and "i".

4) Remove all unused single characters, and related ligatures. This would remove "fj", since "j" is not in the source text, but leave "fi" since both "f" and "i" are, although the "fi" ligature is never used (because we have "ffi" already). I think this is the perfect combination of subsetting and not too demanding.

5) Remove some or all ligatures (the glyphs), but do not remove their definitions. This is not a good idea either, and I think this was the bug in Calibre. It means a renderer supporting ligatures would believe there is a ligature to use for "ffi", but it would't find it.

So, if you can, go for #4. But things may be significantly harder. A font (particulary an OTF one) may contain other alternate shapes for glyphs (final forms, swash forms, older variants, small-caps, etc.), those are currently unused by practically all renderers, but there's still hope that some day we'll be able to enjoy some more advanced typesetting options...

Freeshadow · 01-24-2013, 01:21 PM

Just what jellby said.
I was slower and less detailed at it.

Toxaris · 01-24-2013, 01:33 PM

For now I will probably just add the few ligature glyphs. There aren't that many, so the impact on the size is limited. I should think about the smallcaps, but that one will be at the bottom on the list.

Let me first work on the list and look for pink bidets later...

01-23-2013, 07:59 PM	#16
Hitch Bookmaker & Cat Slave Posts: 11,482 Karma: 158448243 Join Date: Apr 2010 Location: Phoenix, AZ Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2	Tox: Stupid question: NEVER MIND. DUH. H Last edited by Hitch; 01-23-2013 at 08:00 PM. Reason: Really, really stupid.

01-24-2013, 05:05 AM	#21
Toxaris Wizard Posts: 4,520 Karma: 121692313 Join Date: Oct 2009 Location: Heemskerk, NL Device: PRS-T1, Kobo Touch, Kobo Aura	I have thought about this for a while. I think I will start working on the following in the weekend (depends on a lot of personal stuff...): - ability to select an ePUB - parse XHTML to find all characters in use by a certain CSS class - open the used fonts in the ePUB and shrink it according to the used characters for that font - replace the fonts in the ePUB by the shrinked ones. Don't expect it to be ready soon though, it needs quite some testing and the most difficult part will probably be the parsing of the stylesheet to find the classes where a font is defined/used. It might be that an intermediate version will be created where the styles class names have to be entered manually. As as special service to JSWolf () I will automatically add the ligatures to the unique characters used.

01-24-2013, 01:21 PM	#29
Freeshadow temp. out of service Posts: 2,797 Karma: 24285242 Join Date: May 2010 Location: Duisburg (DE) Device: PB 623	Just what jellby said. I was slower and less detailed at it. Last edited by Freeshadow; 01-24-2013 at 01:23 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Working on way to subset fonts for ePub/KF3	Freeshadow	Workshop	51	04-22-2013 04:18 PM
Embedded font-subset sometimes fails	GrannyGrump	Sigil	3	10-20-2012 09:47 AM
group an ARBITRARY subset of records	RotAnal	Library Management	6	10-09-2012 11:53 AM
Kindle 1 Font Mod Tool v0.1	lovebeta	Kindle Developer's Corner	20	04-16-2012 03:06 PM
Is there a tool to see the contents of an embedded font file (ttf)?	James_Wilde	ePub	4	09-06-2010 03:53 PM

01-24-2013, 02:29 AM	#17
Toxaris Wizard Posts: 4,520 Karma: 121692313 Join Date: Oct 2009 Location: Heemskerk, NL Device: PRS-T1, Kobo Touch, Kobo Aura	Now I am curious what the question was...

01-24-2013, 04:06 AM	#19
Toxaris Wizard Posts: 4,520 Karma: 121692313 Join Date: Oct 2009 Location: Heemskerk, NL Device: PRS-T1, Kobo Touch, Kobo Aura	For now, the only method is copy/pasting. I know that is not always handy, but it was the easiest to do and I needed the program now. I see the added value in having it reading ePUB and/or XHTML, but that will be some work. The main problem would be in identifying only the required characters in a class. I will take a look at ePUBtweak to see if I can use the output from it. It might be a good idea.

01-24-2013, 04:16 AM	#20
meme Sigil developer Posts: 1,274 Karma: 1101600 Join Date: Jan 2011 Location: UK Device: Kindle PW, K4 NT, K3, Kobo Touch	The next version of Sigil will have a report listing all the characters visible in Book View. It's not by class though, so to limit it to sections of text would still require you to do some work. It may be it needs to be changed to use Code View - this might allow seeing what is in a class but it would be guessing what is actually visible in Book View (e.g. if a style hides the text using display:none or similar).

01-24-2013, 12:31 PM	#23
Freeshadow temp. out of service Posts: 2,797 Karma: 24285242 Join Date: May 2010 Location: Duisburg (DE) Device: PB 623	I'm really enthusiastic that this particular idea of epub tweaking found that much positive resonance - really no joking here. Tox: while you work on manipulation of the font files you should consider auto-renaming them: both filename & the font name stored inside the font file. AFAIR It's often required even in licences of free fonts when they are changed. While it's relatively meaningless for personal uses it's crucial as soon as your tool matures to become a part of the toolchain used by professional producers. (and aren't more optimized professional books a goal we all wish for?)

01-24-2013, 12:55 PM	#26
meme Sigil developer Posts: 1,274 Karma: 1101600 Join Date: Jan 2011 Location: UK Device: Kindle PW, K4 NT, K3, Kobo Touch	I'm not sure. Do you have an example epub (a link or just a small file is fine) that contains ligatures? The code literally just reports each unicode character that appears in the text (and if it has an entity name).

01-24-2013, 01:20 PM	#28
Jellby frumious Bandersnatch Posts: 7,533 Karma: 19000001 Join Date: Jan 2008 Location: Spaniard in Sweden Device: Cybook Orizon, Kobo Aura	I'll try to explain. A text typically has no explicit ligatures, it could have some, but it should not, and I've only seen some very old text files with them. What a text has is just normal unicode characters, let's say a text consists of the single word "office", that's only 5 different letters: c, e, f, i, o. Now, a font could have ligatures defined, and a reading software may use them (although many do not, I'm afraid). Let's say that the font we are dealing with has the ligatures "fi", "ffi" and "fj" defined. Defining a ligature means that the font has a glyph (a character shape) for the combination "fi" and some instructions saying that whenever there's an f and an i in the text, they should be rendered as the "fi" ligature and not as the separate characters (ditto for "ffi" and "fj"). OK, then our text will ideally be displayed as 4 glyphs: "o", "ffi", "c" "e". There are different things a font subsetter could do: 1) Remove everything but "o", "f", "i", "c", "e", including ligatures and their definition. This is not ideal, but it's probably the simplest. 2) Same as 1, but do not remove ligatures or their definition. That's much better, but it leaves unused glyphs, such as "fi" or "fj". 3) Detect ligatures, find out that "i" and "f" are never used alone, and remove everything but "o", "ffi", "c", "e". This is not a good idea, as renderers that do not support ligatures will not be able to display "f" and "i". 4) Remove all unused single characters, and related ligatures. This would remove "fj", since "j" is not in the source text, but leave "fi" since both "f" and "i" are, although the "fi" ligature is never used (because we have "ffi" already). I think this is the perfect combination of subsetting and not too demanding. 5) Remove some or all ligatures (the glyphs), but do not remove their definitions. This is not a good idea either, and I think this was the bug in Calibre. It means a renderer supporting ligatures would believe there is a ligature to use for "ffi", but it would't find it. So, if you can, go for #4. But things may be significantly harder. A font (particulary an OTF one) may contain other alternate shapes for glyphs (final forms, swash forms, older variants, small-caps, etc.), those are currently unused by practically all renderers, but there's still hope that some day we'll be able to enjoy some more advanced typesetting options...

01-24-2013, 01:33 PM	#30
Toxaris Wizard Posts: 4,520 Karma: 121692313 Join Date: Oct 2009 Location: Heemskerk, NL Device: PRS-T1, Kobo Touch, Kobo Aura	For now I will probably just add the few ligature glyphs. There aren't that many, so the impact on the size is limited. I should think about the smallcaps, but that one will be at the bottom on the list. Let me first work on the list and look for pink bidets later...