|12-05-2010, 08:03 PM||#1|
Join Date: Apr 2007
Location: Idaho, USA
Device: Photon Q, LifeDrive, Tungsten E2
HTML2Mobi and Windows 1252 encoding
Is there a way to make HTML2Mobi work correctly with source files in Windows 1252 (aka Extended ASCII or CP-1252)?
With characters higher than 128 it does things that are quite wrong for Mobi reader on platforms that do not support Unicode but can handle Extended ASCII. What happens is it replaces the original character with a different one and adds more undesired characters next to it.
See the attached text file for a list of UTF-8 codes and their corresponding ASCII characters. Some of them have two UTF-8 codes for the same character. I don't have all of those in that list, will add them later, and also copies of the UTF-8 codes with leading zeros to pad the 2 and 3 digit codes to 4 digits.
I've left out a few in that list which aren't printable and used the nbsp code for non-breaking space.
I presume HTML2Mobi can handle the use of leading zeros, or not, in UTF-8 codes? In the HTML files I've converted I've rarely run into leading zeros used with two or three digit codes. (So much for being "standard", ought to be the zeros must be used or must not be used, not eh, whatever you feel like using, or not using.)
Mobipocket Creator for Windows does handle Extended ASCII perfectly. What it does NOT do is convert UTF-8 codes to ASCII, it converts them to single Unicode characters, which of course makes for a big fat mess on Palm OS or any other non-unicode platform. Since Mobi seems to have abandoned further development on their software, it's highly unlikely a function to convert from UTF-8 to ASCII in the output .prc file will ever get added. Heck, they haven't updated it to use the .mobi file name extension. Nor has it been updated for file names over 32 characters and allowed characters, like spaces, in file names for Palm OS 5.
What would be extremely nice to have is an HTMLtoMobi converter with an option to replace any UTF-8 codes in the source with the equivalent ASCII character.
A person in the UK used Visual Studio 2010 to put together a text file string replacer, which does work for replacing UTF-8 codes (or other text strings) with any other text. That's what I made the attached list for. Unfortunately it still has a bit of a bug where it can't use too long of a list. It writes to the replacement list on exit and it the list is too long it replaces characters from 128 on up with some unprintable character, and it truncates the number of pairs of lines. I do have the source for the program but don't have Visual Studio 2010. Then with an "all bases covered" replacement list it will be an easy one step process to replace UTF-8 codes with ASCII characters, in batches of files, saving heaping piles of time and tedium.
With a limited length list of UTF-8 codes and characters to swap, it works perfectly to prepare HTML files for conversion for non-Unicode platforms for Mobi reader. What I'm going to take a stab at, after getting VS2010, is fixing the bugs with the list so I can put all the UTF-8 codes that have corresponding Windows 1252 codepage characters, then it'll be able to do the pre-conversion on any English language UTF-8 HTML file.
That's what I'll have to do unless/until someone makes a program that can go straight from UTF-8 HTML to a Mobi file with no Unicode characters.
P.S. Before I switched from TealDoc to Mobi, I had to do the same process, but was using Wordpad to do replacements with search and replace. Notepad chokes when there's more than a hundred or so instances to replace. Some books can have 1,000 or more uses of quotation marks. Replacing UTF-8 codes with single characters can cut the file's size down considerably.
Last edited by bizzybody; 12-05-2010 at 08:07 PM.
|Thread Tools||Search this Thread|
|Thread||Thread Starter||Forum||Replies||Last Post|
|Encoding||prusaks||Recipes||0||09-27-2010 06:25 AM|
|how to tell the character encoding???||rheostaticsfan||Calibre||23||06-21-2010 03:26 PM|
|html2mobi - html formatting||brunovg||Kindle Formats||2||12-13-2009 05:56 AM|
|Need help with text encoding||daesdaemar||Workshop||12||12-31-2008 11:54 AM|
|html2mobi (a mobigen replacement written in Perl)||tompe||Kindle Formats||89||02-12-2008 12:33 PM|