View Single Post
Old 08-15-2014, 07:49 AM   #56
shotsky
Enthusiast
shotsky began at the beginning.
 
Posts: 39
Karma: 10
Join Date: Jul 2012
Device: none
Quote:
Originally Posted by DiapDealer View Post
But it doesn't directly edit the markup in an azw3. Sure, it's not doing full-blown conversion, but the raw markup in a binary azw3 file is not standard (x)HTML. There are no filenames, for instance, and the links are offsets, and there's proprietary elements/attributes. Those file names you see in the editor are invented by calibre when it extracts the raw markup and massages the proprietary portions into something that can be easily edited by the user. Then it compiles that back into the kindlebook's proprietary binary database format. You are very much editing an intermediate format.
In my case, I receive ebooks in epub, azw3, mobi, and a few others. My goal is to open the ebook, extract the html, merge the html into a single file and eventually convert the html into a text file with links to photos, that goes into another program. I don't MAKE ebooks, I receive ebooks and convert them to text, all the while handling all the Unicode characters by converting them to the ANSI range when necessary. (Single fraction 1/3 is converted to the three characters, 1, /, and 3. And those games people play with sub and super to invent fractions are also converted to normal three character fractions. It is easy to handle characters, but in addition to that, I need to also identify all the text items on each page. The problem with using Calibre to make an htmlz file, which DOES put all the files into a single html file, is that it renames classes and elements in an unexpected manner, which causes the original meaning of the text that followes to be lost. For example, the original html may call a number class a 'numerator', and another a 'denominator'. When Calibre converts it, they will no longer say 'numerator' and 'denominator', but calibre59 and calibre60. It is no longer possible to know that is a fraction.
My current solution is to run Calibre in debug mode and use the Input folder that results for all of my work. That is all original html, untouched by calibre's renaming of classes. For epubs, that means all the pages are separate files, but the mobi's are all in one file. I don't know why the difference, but my code does rebuild the epub pages into a single html file, stripping all headers and replacing all the html code with html5 code. So my goal of having a single html file without renamed classes is achieved, and my software proceeds to dissect the content using whatever classes and/or other tags were originally provided. It is an AWFUL lot of work to do all that, when all I wanted in the first place was the original html, in a single html file, such as is created by the htmlz convert, only without all the renaming. If htmlz convert worked off the files in the input folder resulting from debug, it would be perfect. But it works off the 'processed' folder, which is substantially different, and in my case, unusable.
shotsky is offline   Reply With Quote