![]() |
#1 |
Junior Member
![]() Posts: 8
Karma: 10
Join Date: Feb 2006
|
Bug: UTF-16 (Output Encoding) error?
To whom it may concern,
If I try to use UTF-16, UTF-16BE, UTF-16LE as "Output Encoding", the result PDB file can be generated successfully, but it is unusable. Is it a bug? FZ |
![]() |
![]() |
#2 |
Jah Blessed
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,295
Karma: 1373
Join Date: Apr 2003
Location: The Netherlands
Device: iPod Touch
|
No, AFAIK Plucker only handles 8-bit and some variable-length encodings. The list in the Sunrise interface shows all encodings supported by the Java installation, but not necessarily by Plucker.
|
![]() |
![]() |
#3 |
Junior Member
![]() Posts: 8
Karma: 10
Join Date: Feb 2006
|
How to unpluck UTF-16-encoded PDB files?
Thanks,
OK, if I use UTF-16 as "Output Encoding", how can I unpluck the generated PDB file? For example, if ASCII is used, when unpluck(in C language): if( buffer[ *position ] != '\0' ) // to test the end of the text string { item_type = IT_CHAR; the_next_char = buffer[ *position ]; *position += 1; } else { *position += 1; code = buffer[ *position ]; size = ( buffer[ *position ] & 0x07 ); *position += 1; ............ } If UTF-16 is used, should I use if( buffer[ *position ] != '\0' && buffer[ (*position)+1 ] != '\0' ) (I think NO, but I dont know how to test the end of the text string) Any suggestion? Thanks again, FZ |
![]() |
![]() |
#4 |
Jah Blessed
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,295
Karma: 1373
Join Date: Apr 2003
Location: The Netherlands
Device: iPod Touch
|
I'm not familiar with unpluck, so I can't help you there. The Plucker spec itself says the encoding should be 8-bit, however, so you shouldn't use 16-bit encodings anyway.
Why not use UTF-8 encoding? A single NULL really marks the end of a text stream in this case. |
![]() |
![]() |
#5 |
Junior Member
![]() Posts: 8
Karma: 10
Join Date: Feb 2006
|
Multibyte charset
Yes, UTF-8 is OK. But when encoding GB2312/JIS/KOR, UTF-16 will result in a smaller(aprox. 30%) PDB file than UTF-8.
|
![]() |
![]() |
#6 |
Jah Blessed
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,295
Karma: 1373
Join Date: Apr 2003
Location: The Netherlands
Device: iPod Touch
|
![]()
Why do you want to use Unicode anyway? AFAIK, there is no Plucker viewer that can handle UTF-8 or UTF-16. Nor does Sunrise add any encoding information to the Plucker document itself. (There should be an MIBenum in the document's meta record, but this field is never actually written.)
|
![]() |
![]() |
#7 |
Junior Member
![]() Posts: 8
Karma: 10
Join Date: Feb 2006
|
Unicode is good for Chinese/Jananese/Korean
![]() The reason I like Unicode is that , AFASIK, Windows Mobile API only supports wide char. So in Chinese/Janapese/Korean, after uncompressing texts in the PDB files(whose charset may be GB2312/GBK/BIG5/JIS......), I must first convert the text strings into Unicode. If the html file is very large, it may take too many seconds to convert (Multibytetowidechar), too slow. So if the PDB file is encoded in Unicode-16, the resulted file is as small as in GB/BIG5..., while the speed is almost as fast as ASCII. Perfect! Yes, no plucker viewer support this feature. But I think it is not difficult to modify the open source viewer such as vade mecum. ![]() ![]() YOU SAID: (There should be an MIBenum in the document's meta record, but this field is never actually written.) Plucker Desktop writes the MIBenum in the document's meta record if the original html file has the meta charset information. I tested this. Sunrise should write it too because it is very important for non english user, especially for asian user. ![]() Last edited by trfzwsc; 02-15-2006 at 11:29 PM. |
![]() |
![]() |
#8 |
Jah Blessed
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,295
Karma: 1373
Join Date: Apr 2003
Location: The Netherlands
Device: iPod Touch
|
I wish Plucker supported Unicode! That would solve the whole character encoding issue. At any rate, UTF-8 would be the way to go, since this works well with standard string functions that assume NULL-terminated sequences and, more importantly, does not require a change to the Plucker specification. IIRC, MultiByteToWidechar() does not support UTF-8 under Windows Mobile, so you'd have to write something yourself. Very easy to do, however.
|
![]() |
![]() |
#9 | |
Vade-Mecum developer
![]() Posts: 6
Karma: 10
Join Date: Apr 2006
Location: Saint-Petersburg, Russia
Device: LOOX 420, Palm m500
|
WM does support UTF-8, actually
Quote:
Plucker DB supports any 8-bit and 7-bit encoding. Unfortunately it, too, falls to the pit of confusing encodings with charsets, but that doesn't stop it from supporting any "8-bit" charset and UTF-8. Furthermore, Plucker DB can have different charsets for different (sequences of) records. This is achieved by using the ExceptionalCharset metadata fields. I've added support for charsets to the recent Vade-Mecum (0.6.6). It understands any charset for which WinCE on a target device has corresponding code page. This includes UTF-8 and UTF-7 that *are* supported at least in WM2003. Different charsets for different records are supported too. What's missing, is the support for charsets for the sequences of "linked" records (those having "Click here for the next/previuos part" links) since the reference (Python) distiller attaches an ExceptionalCharset metadata block to the first record of such sequences only. It does so also for the sequences of continued records, but that's easily handled. Handling this behaviour for linked records is a hell, but I'm working on this. If you wish I can help you handle charset issues in Sunrise XP since I have some degree of understanding about how this is implemented in the reference distiller. And at least you can grab the source of VM and read this: vim-charsets.html I think we should move to e-mail to speed up the efforts. You can mail me at flatworm{}users.sourceforge.net and khomoutov{}gmail.com |
|
![]() |
![]() |
#10 |
Jah Blessed
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,295
Karma: 1373
Join Date: Apr 2003
Location: The Netherlands
Device: iPod Touch
|
Thanks for your detailed explanation, kostix. However, I won't get round to adding support for character sets to Sunrise XP anytime soon. I have this feature planned for v2.1, but that's at least several months away. (In the worst case, I might not get to it at all.)
|
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
UTF-8 encoding not accepted? | paulpeer | Sigil | 5 | 12-09-2013 10:42 AM |
Malformed byte sequence: Invalid byte 2 of 3-byte UTF-8 sequence. Check encoding | digireads | ePub | 3 | 04-26-2011 03:07 AM |
ePub Output Bug, Caused by MSWord | Daddy Warpig | Calibre | 3 | 06-02-2010 09:03 AM |
Video Encoding Time for ipad output | ja-mes | Apple Devices | 1 | 04-27-2010 09:58 PM |
Input/Output error | tulsa | Calibre | 9 | 04-11-2010 10:51 AM |