Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 10-06-2012, 09:22 AM   #1
Freeshadow
temp. out of service
Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.
 
Posts: 2,786
Karma: 24285242
Join Date: May 2010
Location: Duisburg (DE)
Device: PB 623
Working on way to subset fonts for ePub/KF3

Quote:
Originally Posted by Man Eating Duck View Post
(I already have a script somewhere which creates a list of used characters)
For reasons presented in this thread here:
https://www.mobileread.com/forums/sho...49#post2240749

the legal and practical need of such a script is currently discussed ibid.
Freeshadow is offline   Reply With Quote
Old 10-06-2012, 06:56 PM   #2
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 73,510
Karma: 126422064
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Quote:
Originally Posted by Man Eating Duck View Post
(I already have a script somewhere which creates a list of used characters)
What language does this script use? It is something that can easily by used with Windows 7? If so, can you please post it?
JSWolf is offline   Reply With Quote
Advert
Old 10-07-2012, 09:42 AM   #3
Man Eating Duck
Addict
Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.
 
Posts: 254
Karma: 69786
Join Date: May 2006
Location: Oslo, Norway
Device: Kobo Aura, Sony PRS-650
Quote:
Originally Posted by JSWolf View Post
What language does this script use? It is something that can easily by used with Windows 7? If so, can you please post it?
I believe I wrote it in php, which can be installed on Win7. It just accepts an utf-8 input text file and spits out a character list to stdout, it's not very sophisticated. I'll see if I find it when I go to work tomorrow, if not I can probably rewrite it in a few minutes if you're still interested
Man Eating Duck is offline   Reply With Quote
Old 10-08-2012, 04:29 PM   #4
Man Eating Duck
Addict
Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.
 
Posts: 254
Karma: 69786
Join Date: May 2006
Location: Oslo, Norway
Device: Kobo Aura, Sony PRS-650
@JSWolf: Seems it was Python, it works in Python 2.7. I've probably done horrible things to the Python language, but here it is, no guarantuees about anything:
Code:
import argparse, codecs
parser = argparse.ArgumentParser(description='''This script will accept utf-8 text files and write a list of unique characters to stdout or an output file''')
parser.add_argument("file", nargs='+',help="input (utf-8) file(s) for character counting")
parser.add_argument("-o", "--outfile", help="outputfile")
args = parser.parse_args()
disallowed = set('')
s=set()
for f in args.file:
	s=s|set(char for line in codecs.open(f, encoding="UTF-8") for char in line 
         if char not in disallowed)
if args.outfile:
	print 'Writing to file: '+args.outfile;
	with codecs.open(args.outfile, "w", "utf-8") as f:
		f.write(u''.join(s))
		f.close
else:
	print u''.join(s).encode('utf-8')
usage: uniquechars.py [-h] [-o OUTFILE] file [file ...]

Use the output file option for Unicode files, as many glyhs won't show in a console.

If you have questions about the code I can try to answer, but I heard there are some guys in the calibre forum who probably have a bit more experience with Python
Man Eating Duck is offline   Reply With Quote
Old 10-08-2012, 05:46 PM   #5
Freeshadow
temp. out of service
Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.
 
Posts: 2,786
Karma: 24285242
Join Date: May 2010
Location: Duisburg (DE)
Device: PB 623
Python means fontforge could be fed with it just what I tought about.
Freeshadow is offline   Reply With Quote
Advert
Old 10-09-2012, 08:58 AM   #6
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 27,441
Karma: 192992430
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
It'd be nice to eliminate all characters from the script that occur inside html tags. Those wouldn't necessarily need to be a part of any embedded font since they won't be rendered.
DiapDealer is offline   Reply With Quote
Old 10-09-2012, 09:18 AM   #7
Man Eating Duck
Addict
Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.
 
Posts: 254
Karma: 69786
Join Date: May 2006
Location: Oslo, Norway
Device: Kobo Aura, Sony PRS-650
Quote:
Originally Posted by DiapDealer View Post
It'd be nice to eliminate all characters from the script that occur inside html tags. Those wouldn't necessarily need to be a part of any embedded font since they won't be rendered.
My (rather stupid) script expects pure utf-8 text files. You could get those by converting an epub to txt in calibre (remember to specify utf-8 as output encoding). Most authoring software can probably save to txt as well. Formatting doesn't really matter as long as every character is included. This could maybe have been more convenient, but parsing html is outside of my abilities, and I want those results before making an epub as well.

Since you might be interested only in special characters, you could just add a bunch of regular characters that you're not interested in to disallowed = set('') in line 6, ie
Code:
disallowed = set('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789')
This should exclude them from the results. Add as many as you feel like.

The script wasn't really intended for publication, so it's unfortunately pretty rough, and I don't really have enough experience to improve it. It works for my needs, though
Man Eating Duck is offline   Reply With Quote
Old 10-09-2012, 10:16 AM   #8
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 27,441
Karma: 192992430
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Spoiler:
Code:
#!/usr/bin/env python
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai

import sys, argparse, codecs
from HTMLParser import HTMLParser
from re import sub
from sys import stderr
from traceback import print_exc

class _DeHTMLParser(HTMLParser):
	def __init__(self):
		HTMLParser.__init__(self)
		self.__text = []

	def handle_data(self, data):
		text = data.strip()
		if len(text) > 0:
			text = sub('[ \t\r\n]+', ' ', text)
			self.__text.append(text + ' ')

	def handle_starttag(self, tag, attrs):
		if tag == 'p':
			self.__text.append('\n\n')
		elif tag == 'br':
			self.__text.append('\n')

	def handle_startendtag(self, tag, attrs):
		if tag == 'br':
			self.__text.append('\n\n')

	def text(self):
		return ''.join(self.__text).strip()

def main():
	parser = argparse.ArgumentParser(description='''This script will accept utf-8 text files and write a list of unique characters to stdout or an output file''')
	parser.add_argument("file", nargs='+',help="input (utf-8) file(s) for character counting")
	parser.add_argument("-o", "--outfile", help="outputfile")
	parser.add_argument("-c", "--codec", help="input char encoding")
	args = parser.parse_args()
	disallowed = set('')
	s=set()
	if not args.codec:
		file_codec = 'utf-8'
	else:
		file_codec = args.codec
	for f in args.file:
		try:
			html_parser = _DeHTMLParser()
			html_parser.feed(unicode(file(f, 'r').read(), file_codec))
			html_parser.close()
			text = html_parser.text()
			s=s|set(char for line in text for char in line 
					if char not in disallowed)
		except:
			print_exc(file=stderr)
	if args.outfile:
		print 'Writing to file: '+args.outfile;
		with codecs.open(args.outfile, 'w', file_codec) as f:
			f.write(''.join(s))
			f.close
	else:
		print ''.join(s).encode(file_codec)
	
if __name__ == '__main__':
	sys.exit(main())


usage: uniquechars.py [-h] [-c CODEC] [-o OUTFILE] file [file ...]

An attempt to modify so that only the text of an html document is parsed and also allow the input/output of other charset encodings. The default is utf-8 if not specified on the command-line. I got it to work with either utf-8 or windows-1252.
DiapDealer is offline   Reply With Quote
Old 10-10-2012, 04:05 AM   #9
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,514
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
I would render the HTML in a browser, copy and paste in a text file, and extract the unique chars from there.
Jellby is online now   Reply With Quote
Old 10-10-2012, 06:25 AM   #10
pdurrant
The Grand Mouse 高貴的老鼠
pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.
 
pdurrant's Avatar
 
Posts: 71,367
Karma: 305065800
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Voyage
Quote:
Originally Posted by Jellby View Post
I would render the HTML in a browser, copy and paste in a text file, and extract the unique chars from there.
That's hard to automate and is overkill. If we accidentally include one or two characters that aren't in the rendered text, that's not a problem.

Now we need something that can process a TTF or OTF file and create a sub-set of the font.
pdurrant is offline   Reply With Quote
Old 10-10-2012, 07:17 AM   #11
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,582
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
BTW, Python-challenged ebook designers could simply compile an epub with KindlePreviewer/KindleGen and have a look at the detected Unicode ranges in the log file. For example, if you compile the book mentioned in roger64's post you'll see the following output:

Code:
Info(prcgen):I1045: Computing UNICODE ranges used in the book
Info(prcgen):I1046: Found UNICODE range: Basic Latin [20..7E]
Info(prcgen):I1046: Found UNICODE range: General Punctuation - Windows 1252 [2018..201A]
Info(prcgen):I1046: Found UNICODE range: Latin-1 Supplement [A0..FF]
Info(prcgen):I1046: Found UNICODE range: General Punctuation - other than Windows 1252 [2015..2017]
Info(prcgen):I1046: Found UNICODE range: Latin Extended-A [100..17F]
Info(prcgen):I1046: Found UNICODE range: Basic Greek [370..3FF]
Info(prcgen):I1046: Found UNICODE range: Greek Extended [1F00..1FFF]
Windows users could then use the MS Font Properties Extension to check the Unicode coverage of their fonts.
Doitsu is offline   Reply With Quote
Old 10-10-2012, 07:24 AM   #12
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,514
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Quote:
Originally Posted by pdurrant View Post
That's hard to automate and is overkill.
I agree, but it could be easier if Sigil makes the list, I've just created a suggestion for this.
Jellby is online now   Reply With Quote
Old 10-10-2012, 07:49 AM   #13
pdurrant
The Grand Mouse 高貴的老鼠
pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.
 
pdurrant's Avatar
 
Posts: 71,367
Karma: 305065800
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Voyage
Quote:
Originally Posted by Doitsu View Post
BTW, Python-challenged ebook designers could simply compile an epub with KindlePreviewer/KindleGen and have a look at the detected Unicode ranges in the log file.
I think that getting the ranges isn't fine grained enough. We're not wanting to check that our fonts cover the characters used, but to trim the fonts to cover only the characters used. Of course, making sure that all the needed characters are in the font will be part of this.
pdurrant is offline   Reply With Quote
Old 10-10-2012, 08:02 AM   #14
Man Eating Duck
Addict
Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.
 
Posts: 254
Karma: 69786
Join Date: May 2006
Location: Oslo, Norway
Device: Kobo Aura, Sony PRS-650
Quote:
Originally Posted by pdurrant View Post
I think that getting the ranges isn't fine grained enough. We're not wanting to check that our fonts cover the characters used, but to trim the fonts to cover only the characters used. Of course, making sure that all the needed characters are in the font will be part of this.
Actually I am very interested in finding out if there are missing glyphs... Small squares in place of characters is a far larger problem than a few tens of Kilobytes in size IMO

I suspect that most methods of subsetting would also give you a "free" coverage check in the bargain.
Man Eating Duck is offline   Reply With Quote
Old 10-10-2012, 08:17 AM   #15
Freeshadow
temp. out of service
Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.Freeshadow ought to be getting tired of karma fortunes by now.
 
Posts: 2,786
Karma: 24285242
Join Date: May 2010
Location: Duisburg (DE)
Device: PB 623
Quote:
Originally Posted by pdurrant View Post
Now we need something that can process a TTF or OTF file and create a sub-set of the font.
Fontforge can AFAIR be controlled by python scripts.
Freeshadow is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
group an ARBITRARY subset of records RotAnal Library Management 6 10-09-2012 11:53 AM
Working with Fonts and Calibre kiwidude Development 8 03-04-2011 07:49 PM
Fonts not working in a converted book snape Sony Reader 9 11-08-2010 11:46 PM
Changing fonts not working? tselling Astak EZReader 11 09-21-2009 03:03 PM
Why are some fonts not working?? daviddem HanLin eBook 4 01-22-2009 09:14 AM


All times are GMT -4. The time now is 04:24 AM.


MobileRead.com is a privately owned, operated and funded community.