03-14-2012, 05:00 AM | #1 |
Wizard
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
Unique characters used
In another thread about including fonts, of course fontsize came up. Usually fonts are big so they will add to the ePUB size.
However, it is possible to reduce a font to only the characters you use (headers, notes, foreign). This helps to reduce the size tremendously. Normal methods are via either FontSquirrel or FontForge. Usually the biggest problem is, how to determine which characters you actually need. Therefore I created a Word macro to get the unique characters in a document. It can be for all fonts or for a specific font. With that list the font can be reduced. Please take care, some (most?) fonts may not be distributed. Since with this you can make a subset, it might be allowed. To be safe, only use it on free fonts. * Update. Small error Last edited by Toxaris; 03-14-2012 at 04:14 PM. |
03-14-2012, 05:27 AM | #2 |
Linux User
Posts: 2,279
Karma: 6123806
Join Date: Sep 2010
Location: Heidelberg, Germany
Device: none
|
Well if the subset is complete enough to cover an entire book, chances are it won't be allowed still.
I wasn't too successful with FontForge when it came to removing characters in a font - somehow what fontforge saved was several times the size of the original font. |
Advert | |
|
03-14-2012, 07:57 AM | #3 |
Wizard
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
As long as you actually delete the glyphs and not clear them, it should definitely be smaller. You can also use FontSquirrel and input the glyphs you need.
|
03-14-2012, 10:44 AM | #4 |
frumious Bandersnatch
Posts: 7,515
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
I guess you can create a single XHTML document, apply "display:none" to everything except the parts that use the font you want", open the file in a browser, copy the text somewhere else and identify the unique characters there.
|
03-14-2012, 12:41 PM | #5 |
Enthusiast
Posts: 25
Karma: 10
Join Date: Oct 2011
Device: none
|
@frostschutz :
Don't know if this may help, but here is how I subset fonts with fontforge, using a script which: 1) open the font with Open() 2) unselect all characters with SelectNone() 3) select all needed characters with SelectMore() 4) invert the selection with SelectInvert() 5) delete the selected characters with Clear() 6) create a new font with Generate() If some needed characters are references, you need to selected both the characters and the referenced characters. as an example: a 1252 characters, 138KB truetype fonts subsetted to 145 characters then weights 32KB |
Advert | |
|
03-14-2012, 04:05 PM | #6 |
Fanatic
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
|
Here's how you can find the characters used in a xhtml file (tags are excluded) in a unix bash shell:
Code:
cat file.xhtml|sed -e 's/<[^>]\+>//g' -e 's/./&\n/g' |sort -u |tr "\n" " " Code:
grep "<h[1-4]" OEBPS/vol1/12.xhtml|sed -e 's/<[^>]\+>//g' -e 's/./&\n/g' |sort -u |tr "\n" " " |
03-14-2012, 04:15 PM | #7 |
Wizard
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
That will work, but usually I want to know it for a specific font and since Word is already in my process...
BTW, the macro is updated. Stupid VBA is not always case-sensitive. |
03-15-2012, 04:01 PM | #8 |
frumious Bandersnatch
Posts: 7,515
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
Those one-liners don't decode entities, I'm afraid (although they can be converted beforehand with recode).
|
03-16-2012, 03:50 AM | #9 |
Fanatic
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
|
@Jellby: *Sigh* isn't there just always something... thanks for pointing it out. I try again:
Code:
cat file.xhtml|xml2asc|sed -e 's/<[^>]\+>//g' -e 's/./&\n/g' |sort -u |tr "\n" " " |
03-16-2012, 12:36 PM | #10 |
frumious Bandersnatch
Posts: 7,515
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
Well, xml2asc in my box doesn't seem to do what you apparently use it for:
Reads an UTF-8 encoded text from standard input and writes to standard output, converting all non-ASCII characters to &#nnn; entities, so that the result is ASCII-encoded. Also, consider what to do with an input such as: Code:
Let a < ε > b |
03-16-2012, 05:28 PM | #11 |
Fanatic
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
|
@Jellby: Right.... Not exactly a one liner anymore, and the script has all the clarity of white noise. Couldn't be bothered to double-check @font-face in css, and tags with a font inside another tag with a font is not handled correctly (the outer tag will include characters from the inner tag). Typically used as :
./script.sh <xhtml-file> <css-file> If the css is inline, it's just ./script.sh <xhtml-file> Tries to detect extra fonts in the css-file, which classes/ids which use them, and lists which characters are used for which fonts. Code:
#!/bin/bash file=$1 css=${!#} xmlns=$(grep -o "xmlns=.[^\"']\+" ${file}|cut -c8-180) # remove comments awk 'BEGIN{RS="\(<.\-\-\|\-\->\)"} {if ((NR % 2)==1) print;}' $file > tmp #replace html entities for x in $(sed 's/&[a-zA-Z0-9]\+;/&\n/g' tmp|grep -o "&[a-zA-Z0-9]\+;") do sed -i "s/${x}/$(echo $x|recode HTML..UTF-8)/g" tmp done # extract inline css (if [[ $(grep -Fxq '</style' $css) ]] then sed -n /<style/,/<\/style/p $css else cat $css fi)|\ tr "\n" " " |\ sed 's/[>}]/&\n/g' |\ grep -v "@font-face" |\ sed -n "/font-family: *[\"']/{s/^ *\(.*\) *{.*font-family: *[\"']\([^\"']\+\).*/\1 \2/;p}" |\ sed -e '/^\./s/^/\*#/' -e 's#\(.*\)\.\([^ \]\+\)#a:\1[@class="\2"]#' -e 's/.*#\([^ ]*\)/*[@id="\1"]/'|\ while read line do echo echo "${line#* }: " echo -e "setns a=${xmlns}\ncat //${line%% *}//text()" |\ xmllint --shell tmp |\ sed -e 1d -e '/^\(\/ >\| -\{7\}$\)/d' -e 's/./&\n/g' |\ sort -u |\ sed '/[ \t]/d' |\ sed -n 'H;${x;s/\n//g;p}' done Last edited by SBT; 03-16-2012 at 06:12 PM. |
03-16-2012, 05:47 PM | #12 |
Wizard
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
If I look at the scripts, I rather use my Word macro...
|
03-16-2012, 06:17 PM | #13 |
Fanatic
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
|
I have to admit it it perhaps not the most aesthetically pleasing program to rest one's weary eyes upon... However, it might come in handy when refurbishing existing epubs, since it operates on xhtml files.
|
03-18-2012, 03:52 AM | #14 |
Wizard
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
Hmm, someone mentioned smallcaps. It might be an idea to also make it possible for smallcaps. I think I will enhance it further later this week.
|
03-18-2012, 06:02 AM | #15 |
frumious Bandersnatch
Posts: 7,515
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
The problem with smallcaps is you'd like them to match the normal text. If you embed a font for smallcaps and not for the rest, you'll create the possibility of having, for instance, a sans-serif text with serif smallcaps (ugh).
|
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Is Amazon unique | jbcohen | Reading Recommendations | 2 | 05-11-2011 10:36 AM |
Your Most Unique Bookmark? | distant.star | Lounge | 12 | 09-12-2010 12:52 AM |
Accessories Unique leather cases? | 123YayKindle | Amazon Kindle | 1 | 08-17-2010 11:56 AM |
hello, all the best for this unique community | FiGi | Introduce Yourself | 4 | 04-03-2009 12:25 PM |
Is this a unique idea for bookmarks? | Dr. Drib | Sony Reader Dev Corner | 2 | 09-10-2008 04:40 PM |