View Single Post
Old 01-22-2011, 05:48 PM   #3
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,647
Karma: 5433388
Join Date: Nov 2009
Device: many
Hi,

If you look in the _XML.zip you can actually see what the Topaz internal format looks like. The file format is xml based and even the glyphs themselves are described by xml files. As for reflowing, all but "fixed" regions can be reflowed. So save space, these xml files are encoded as binary data with offsets into a dictionary that is stored as the dict0000.dat file and stored as records in the Topaz file.

What I would like to do is combine the ocrText field info with the svg image of the page to create an image based pdf file with text searching capabilities. Unfortunately, I know of no opensource pdf creation software that will generate an image based pdf with ocrtext information embedded inside of it for searching.

If anyone knows of perl, python, or even C/C++ based code that can convert a sequence of svg images along with ocrText information into a searchable image based pdf, I would love to hear about it.

I do believe someone could take a regular ebook/xhtml file and use svg fonts to encode each letter (using a one to one mapping) and actually create an xml file describing each page of the book and an xml file for each glyph in the fonts and combine them to create a Topaz book of some sort.

It would take some code to do it but I do think it could be done. The nice thing is that the actual text of the xhtml file can be used in place of OCRText so there should be no errors.
KevinH is offline   Reply With Quote