MobileRead Forums - View Single Post

KevinH · 01-22-2011, 05:48 PM

Hi,

If you look in the _XML.zip you can actually see what the Topaz internal format looks like. The file format is xml based and even the glyphs themselves are described by xml files. As for reflowing, all but "fixed" regions can be reflowed. So save space, these xml files are encoded as binary data with offsets into a dictionary that is stored as the dict0000.dat file and stored as records in the Topaz file.

What I would like to do is combine the ocrText field info with the svg image of the page to create an image based pdf file with text searching capabilities. Unfortunately, I know of no opensource pdf creation software that will generate an image based pdf with ocrtext information embedded inside of it for searching.

If anyone knows of perl, python, or even C/C++ based code that can convert a sequence of svg images along with ocrText information into a searchable image based pdf, I would love to hear about it.

I do believe someone could take a regular ebook/xhtml file and use svg fonts to encode each letter (using a one to one mapping) and actually create an xml file describing each page of the book and an xml file for each glyph in the fonts and combine them to create a Topaz book of some sort.

It would take some code to do it but I do think it could be done. The nice thing is that the actual text of the xhtml file can be used in place of OCRText so there should be no errors.

01-22-2011, 05:48 PM	#3
KevinH Sigil Developer Posts: 7,647 Karma: 5433388 Join Date: Nov 2009 Device: many	Hi, If you look in the _XML.zip you can actually see what the Topaz internal format looks like. The file format is xml based and even the glyphs themselves are described by xml files. As for reflowing, all but "fixed" regions can be reflowed. So save space, these xml files are encoded as binary data with offsets into a dictionary that is stored as the dict0000.dat file and stored as records in the Topaz file. What I would like to do is combine the ocrText field info with the svg image of the page to create an image based pdf file with text searching capabilities. Unfortunately, I know of no opensource pdf creation software that will generate an image based pdf with ocrtext information embedded inside of it for searching. If anyone knows of perl, python, or even C/C++ based code that can convert a sequence of svg images along with ocrText information into a searchable image based pdf, I would love to hear about it. I do believe someone could take a regular ebook/xhtml file and use svg fonts to encode each letter (using a one to one mapping) and actually create an xml file describing each page of the book and an xml file for each glyph in the fonts and combine them to create a Topaz book of some sort. It would take some code to do it but I do think it could be done. The nice thing is that the actual text of the xhtml file can be used in place of OCRText so there should be no errors.