Unique characters used

Toxaris · 03-14-2012, 05:00 AM

In another thread about including fonts, of course fontsize came up. Usually fonts are big so they will add to the ePUB size.
However, it is possible to reduce a font to only the characters you use (headers, notes, foreign). This helps to reduce the size tremendously.
Normal methods are via either FontSquirrel or FontForge. Usually the biggest problem is, how to determine which characters you actually need.

Therefore I created a Word macro to get the unique characters in a document. It can be for all fonts or for a specific font. With that list the font can be reduced.

Please take care, some (most?) fonts may not be distributed. Since with this you can make a subset, it might be allowed. To be safe, only use it on free fonts.

* Update. Small error

frostschutz · 03-14-2012, 05:27 AM

Well if the subset is complete enough to cover an entire book, chances are it won't be allowed still.

I wasn't too successful with FontForge when it came to removing characters in a font - somehow what fontforge saved was several times the size of the original font.

Toxaris · 03-14-2012, 07:57 AM

As long as you actually delete the glyphs and not clear them, it should definitely be smaller. You can also use FontSquirrel and input the glyphs you need.

Jellby · 03-14-2012, 10:44 AM

I guess you can create a single XHTML document, apply "display:none" to everything except the parts that use the font you want", open the file in a browser, copy the text somewhere else and identify the unique characters there.

Trouhel · 03-14-2012, 12:41 PM

@frostschutz :

Don't know if this may help, but here is how I subset fonts with fontforge, using a script which:

1) open the font with Open()
2) unselect all characters with SelectNone()
3) select all needed characters with SelectMore()
4) invert the selection with SelectInvert()
5) delete the selected characters with Clear()
6) create a new font with Generate()

If some needed characters are references, you need to selected both the characters and the referenced characters.

as an example: a 1252 characters, 138KB truetype fonts subsetted to 145 characters then weights 32KB

SBT · 03-14-2012, 04:05 PM

Here's how you can find the characters used in a xhtml file (tags are excluded) in a unix bash shell:

Code:

cat file.xhtml|sed -e 's/<[^>]\+>//g' -e 's/./&\n/g'  |sort -u |tr "\n" " "

If you want to just find the characters in headers, you can try:

Code:

grep "<h[1-4]" OEBPS/vol1/12.xhtml|sed -e 's/<[^>]\+>//g' -e 's/./&\n/g'  |sort -u |tr "\n" " "

Toxaris · 03-14-2012, 04:15 PM

That will work, but usually I want to know it for a specific font and since Word is already in my process...

BTW, the macro is updated. Stupid VBA is not always case-sensitive.

Jellby · 03-15-2012, 04:01 PM

Those one-liners don't decode entities, I'm afraid (although they can be converted beforehand with recode).

SBT · 03-16-2012, 03:50 AM

@Jellby: *Sigh* isn't there just always something... thanks for pointing it out. I try again:

Code:

cat file.xhtml|xml2asc|sed -e 's/<[^>]\+>//g' -e 's/./&\n/g'  |sort -u |tr "\n" " "

How many lines do you need to do this properly, I wonder; find which tags use special fonts, extract their content etc.?

Jellby · 03-16-2012, 12:36 PM

Well, xml2asc in my box doesn't seem to do what you apparently use it for:

Reads an UTF-8 encoded text from standard input and writes to standard output, converting all non-ASCII characters to &#nnn; entities, so that the result is ASCII-encoded.

Also, consider what to do with an input such as:

Code:

Let a &lt; &epsilon; &gt; b

Do you get <, ε, and > in the character list?

SBT · 03-16-2012, 05:28 PM

@Jellby: Right.... Not exactly a one liner anymore, and the script has all the clarity of white noise. Couldn't be bothered to double-check @font-face in css, and tags with a font inside another tag with a font is not handled correctly (the outer tag will include characters from the inner tag). Typically used as :
./script.sh <xhtml-file> <css-file>
If the css is inline, it's just
./script.sh <xhtml-file>
Tries to detect extra fonts in the css-file, which classes/ids which use them, and lists which characters are used for which fonts.

Code:

#!/bin/bash
file=$1
css=${!#}
xmlns=$(grep -o "xmlns=.[^\"']\+" ${file}|cut -c8-180)
# remove comments
awk 'BEGIN{RS="\(<.\-\-\|\-\->\)"} {if ((NR % 2)==1) print;}' $file > tmp
#replace html entities
for x in $(sed 's/&[a-zA-Z0-9]\+;/&\n/g' tmp|grep -o "&[a-zA-Z0-9]\+;")
do sed -i "s/${x}/$(echo $x|recode HTML..UTF-8)/g" tmp
done
# extract inline css
(if [[ $(grep -Fxq '</style' $css) ]]
 then sed -n /<style/,/<\/style/p $css
 else cat $css
fi)|\
tr "\n" " " |\
sed 's/[>}]/&\n/g' |\
grep -v "@font-face" |\
sed -n "/font-family: *[\"']/{s/^ *\(.*\) *{.*font-family: *[\"']\([^\"']\+\).*/\1 \2/;p}" |\
sed -e '/^\./s/^/\*#/' -e 's#\(.*\)\.\([^ \]\+\)#a:\1[@class="\2"]#' -e 's/.*#\([^ ]*\)/*[@id="\1"]/'|\
while read line
do
echo
echo  "${line#* }: "
 echo -e "setns a=${xmlns}\ncat //${line%% *}//text()" |\
xmllint --shell tmp |\
sed -e 1d -e '/^\(\/ >\| -\{7\}$\)/d' -e 's/./&\n/g' |\
sort -u |\
sed '/[ \t]/d' |\
sed -n 'H;${x;s/\n//g;p}'
done

Toxaris · 03-16-2012, 05:47 PM

If I look at the scripts, I rather use my Word macro...

SBT · 03-16-2012, 06:17 PM

I have to admit it it perhaps not the most aesthetically pleasing program to rest one's weary eyes upon... However, it might come in handy when refurbishing existing epubs, since it operates on xhtml files.

Toxaris · 03-18-2012, 03:52 AM

Hmm, someone mentioned smallcaps. It might be an idea to also make it possible for smallcaps. I think I will enhance it further later this week.

Jellby · 03-18-2012, 06:02 AM

The problem with smallcaps is you'd like them to match the normal text. If you embed a font for smallcaps and not for the rest, you'll create the possibility of having, for instance, a sans-serif text with serif smallcaps (ugh).

03-14-2012, 04:05 PM	#6
SBT Fanatic Posts: 580 Karma: 810184 Join Date: Sep 2010 Location: Norway Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad	Here's how you can find the characters used in a xhtml file (tags are excluded) in a unix bash shell: Code: cat file.xhtml\|sed -e 's/<[^>]\+>//g' -e 's/./&\n/g' \|sort -u \|tr "\n" " " If you want to just find the characters in headers, you can try: Code: grep "<h[1-4]" OEBPS/vol1/12.xhtml\|sed -e 's/<[^>]\+>//g' -e 's/./&\n/g' \|sort -u \|tr "\n" " "

03-16-2012, 03:50 AM	#9
SBT Fanatic Posts: 580 Karma: 810184 Join Date: Sep 2010 Location: Norway Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad	@Jellby: Sigh isn't there just always something... thanks for pointing it out. I try again: Code: cat file.xhtml\|xml2asc\|sed -e 's/<[^>]\+>//g' -e 's/./&\n/g' \|sort -u \|tr "\n" " " How many lines do you need to do this properly, I wonder; find which tags use special fonts, extract their content etc.?

03-16-2012, 12:36 PM	#10
Jellby frumious Bandersnatch Posts: 7,515 Karma: 18512745 Join Date: Jan 2008 Location: Spaniard in Sweden Device: Cybook Orizon, Kobo Aura	Well, xml2asc in my box doesn't seem to do what you apparently use it for: Reads an UTF-8 encoded text from standard input and writes to standard output, converting all non-ASCII characters to &#nnn; entities, so that the result is ASCII-encoded. Also, consider what to do with an input such as: Code: Let a < ε > b Do you get <, ε, and > in the character list?

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Is Amazon unique	jbcohen	Reading Recommendations	2	05-11-2011 10:36 AM
Your Most Unique Bookmark?	distant.star	Lounge	12	09-12-2010 12:52 AM
Accessories Unique leather cases?	123YayKindle	Amazon Kindle	1	08-17-2010 11:56 AM
hello, all the best for this unique community	FiGi	Introduce Yourself	4	04-03-2009 12:25 PM
Is this a unique idea for bookmarks?	Dr. Drib	Sony Reader Dev Corner	2	09-10-2008 04:40 PM

03-14-2012, 05:27 AM	#2
frostschutz Linux User Posts: 2,279 Karma: 6123806 Join Date: Sep 2010 Location: Heidelberg, Germany Device: none	Well if the subset is complete enough to cover an entire book, chances are it won't be allowed still. I wasn't too successful with FontForge when it came to removing characters in a font - somehow what fontforge saved was several times the size of the original font.

03-14-2012, 07:57 AM	#3
Toxaris Wizard Posts: 4,520 Karma: 121692313 Join Date: Oct 2009 Location: Heemskerk, NL Device: PRS-T1, Kobo Touch, Kobo Aura	As long as you actually delete the glyphs and not clear them, it should definitely be smaller. You can also use FontSquirrel and input the glyphs you need.

03-14-2012, 10:44 AM	#4
Jellby frumious Bandersnatch Posts: 7,515 Karma: 18512745 Join Date: Jan 2008 Location: Spaniard in Sweden Device: Cybook Orizon, Kobo Aura	I guess you can create a single XHTML document, apply "display:none" to everything except the parts that use the font you want", open the file in a browser, copy the text somewhere else and identify the unique characters there.

03-14-2012, 12:41 PM	#5
Trouhel Enthusiast Posts: 25 Karma: 10 Join Date: Oct 2011 Device: none	@frostschutz : Don't know if this may help, but here is how I subset fonts with fontforge, using a script which: 1) open the font with Open() 2) unselect all characters with SelectNone() 3) select all needed characters with SelectMore() 4) invert the selection with SelectInvert() 5) delete the selected characters with Clear() 6) create a new font with Generate() If some needed characters are references, you need to selected both the characters and the referenced characters. as an example: a 1252 characters, 138KB truetype fonts subsetted to 145 characters then weights 32KB

03-14-2012, 04:15 PM	#7
Toxaris Wizard Posts: 4,520 Karma: 121692313 Join Date: Oct 2009 Location: Heemskerk, NL Device: PRS-T1, Kobo Touch, Kobo Aura	That will work, but usually I want to know it for a specific font and since Word is already in my process... BTW, the macro is updated. Stupid VBA is not always case-sensitive.

03-15-2012, 04:01 PM	#8
Jellby frumious Bandersnatch Posts: 7,515 Karma: 18512745 Join Date: Jan 2008 Location: Spaniard in Sweden Device: Cybook Orizon, Kobo Aura	Those one-liners don't decode entities, I'm afraid (although they can be converted beforehand with recode).

03-16-2012, 05:47 PM	#12
Toxaris Wizard Posts: 4,520 Karma: 121692313 Join Date: Oct 2009 Location: Heemskerk, NL Device: PRS-T1, Kobo Touch, Kobo Aura	If I look at the scripts, I rather use my Word macro...

03-16-2012, 06:17 PM	#13
SBT Fanatic Posts: 580 Karma: 810184 Join Date: Sep 2010 Location: Norway Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad	I have to admit it it perhaps not the most aesthetically pleasing program to rest one's weary eyes upon... However, it might come in handy when refurbishing existing epubs, since it operates on xhtml files.

03-18-2012, 03:52 AM	#14
Toxaris Wizard Posts: 4,520 Karma: 121692313 Join Date: Oct 2009 Location: Heemskerk, NL Device: PRS-T1, Kobo Touch, Kobo Aura	Hmm, someone mentioned smallcaps. It might be an idea to also make it possible for smallcaps. I think I will enhance it further later this week.

03-18-2012, 06:02 AM	#15
Jellby frumious Bandersnatch Posts: 7,515 Karma: 18512745 Join Date: Jan 2008 Location: Spaniard in Sweden Device: Cybook Orizon, Kobo Aura	The problem with smallcaps is you'd like them to match the normal text. If you embed a font for smallcaps and not for the rest, you'll create the possibility of having, for instance, a sans-serif text with serif smallcaps (ugh).

Advert

Advert