Bug: UTF-16 (Output Encoding) error?

trfzwsc · 02-14-2006, 04:58 AM

To whom it may concern,

If I try to use UTF-16, UTF-16BE, UTF-16LE as "Output Encoding",
the result PDB file can be generated successfully, but it is unusable.
Is it a bug?

FZ

Laurens · 02-14-2006, 05:26 AM

No, AFAIK Plucker only handles 8-bit and some variable-length encodings. The list in the Sunrise interface shows all encodings supported by the Java installation, but not necessarily by Plucker.

trfzwsc · 02-14-2006, 09:54 PM

Thanks,

OK, if I use UTF-16 as "Output Encoding", how can I unpluck the generated PDB file?

For example, if ASCII is used, when unpluck(in C language):

if( buffer[ *position ] != '\0' ) // to test the end of the text string
{
item_type = IT_CHAR;
the_next_char = buffer[ *position ];
*position += 1;
}
else
{
*position += 1;

code = buffer[ *position ];
size = ( buffer[ *position ] & 0x07 );

*position += 1;

............
}

If UTF-16 is used, should I use

if( buffer[ *position ] != '\0' && buffer[ (*position)+1 ] != '\0' )

(I think NO, but I dont know how to test the end of the text string)

Any suggestion?

Thanks again,
FZ

Laurens · 02-15-2006, 03:29 AM

I'm not familiar with unpluck, so I can't help you there. The Plucker spec itself says the encoding should be 8-bit, however, so you shouldn't use 16-bit encodings anyway.

Why not use UTF-8 encoding? A single NULL really marks the end of a text stream in this case.

trfzwsc · 02-15-2006, 05:58 AM

Yes, UTF-8 is OK. But when encoding GB2312/JIS/KOR, UTF-16 will result in a smaller(aprox. 30%) PDB file than UTF-8.

Laurens · 02-15-2006, 06:43 AM

Why do you want to use Unicode anyway? AFAIK, there is no Plucker viewer that can handle UTF-8 or UTF-16. Nor does Sunrise add any encoding information to the Plucker document itself. (There should be an MIBenum in the document's meta record, but this field is never actually written.)

trfzwsc · 02-16-2006, 12:24 AM

The reason I like Unicode is that , AFASIK, Windows Mobile API only supports wide char. So in Chinese/Janapese/Korean, after uncompressing texts in the PDB files(whose charset may be GB2312/GBK/BIG5/JIS......), I must first convert the text strings into Unicode. If the html file is very large, it may take too many seconds to convert (Multibytetowidechar), too slow. So if the PDB file is encoded in Unicode-16, the resulted file is as small as in GB/BIG5..., while the speed is almost as fast as ASCII. Perfect!

Yes, no plucker viewer support this feature. But I think it is not difficult to modify the open source viewer such as vade mecum.

YOU SAID: (There should be an MIBenum in the document's meta record, but this field is never actually written.)

Plucker Desktop writes the MIBenum in the document's meta record if the original html file has the meta charset information. I tested this.
Sunrise should write it too because it is very important for non english user, especially for asian user.

Laurens · 02-16-2006, 04:28 AM

I wish Plucker supported Unicode! That would solve the whole character encoding issue. At any rate, UTF-8 would be the way to go, since this works well with standard string functions that assume NULL-terminated sequences and, more importantly, does not require a change to the Plucker specification. IIRC, MultiByteToWidechar() does not support UTF-8 under Windows Mobile, so you'd have to write something yourself. Very easy to do, however.

kostix · 04-02-2006, 08:58 AM

Quote:

Originally Posted by Laurens

I wish Plucker supported Unicode! That would solve the whole character encoding issue. At any rate, UTF-8 would be the way to go, since this works well with standard string functions that assume NULL-terminated sequences and, more importantly, does not require a change to the Plucker specification. IIRC, MultiByteToWidechar() does not support UTF-8 under Windows Mobile, so you'd have to write something yourself. Very easy to do, however.

I think there's no point in solving encoding issues *that* way.
Plucker DB supports any 8-bit and 7-bit encoding. Unfortunately it, too, falls to the pit of confusing encodings with charsets, but that doesn't stop it from supporting any "8-bit" charset and UTF-8. Furthermore, Plucker DB can have different charsets for different (sequences of) records. This is achieved by using the ExceptionalCharset metadata fields.

I've added support for charsets to the recent Vade-Mecum (0.6.6). It understands any charset for which WinCE on a target device has corresponding code page. This includes UTF-8 and UTF-7 that *are* supported at least in WM2003. Different charsets for different records are supported too.

What's missing, is the support for charsets for the sequences of "linked" records (those having "Click here for the next/previuos part" links) since the reference (Python) distiller attaches an ExceptionalCharset metadata block to the first record of such sequences only. It does so also for the sequences of continued records, but that's easily handled. Handling this behaviour for linked records is a hell, but I'm working on this.

If you wish I can help you handle charset issues in Sunrise XP since I have some degree of understanding about how this is implemented in the reference distiller. And at least you can grab the source of VM and read this:
vim-charsets.html

I think we should move to e-mail to speed up the efforts.
You can mail me at flatworm{}users.sourceforge.net
and khomoutov{}gmail.com

Laurens · 04-09-2006, 02:47 PM

Thanks for your detailed explanation, kostix. However, I won't get round to adding support for character sets to Sunrise XP anytime soon. I have this feature planned for v2.1, but that's at least several months away. (In the worst case, I might not get to it at all.)

02-14-2006, 04:58 AM	#1
trfzwsc Junior Member Posts: 8 Karma: 10 Join Date: Feb 2006	Bug: UTF-16 (Output Encoding) error? To whom it may concern, If I try to use UTF-16, UTF-16BE, UTF-16LE as "Output Encoding", the result PDB file can be generated successfully, but it is unusable. Is it a bug? FZ

02-14-2006, 09:54 PM	#3
trfzwsc Junior Member Posts: 8 Karma: 10 Join Date: Feb 2006	How to unpluck UTF-16-encoded PDB files? Thanks, OK, if I use UTF-16 as "Output Encoding", how can I unpluck the generated PDB file? For example, if ASCII is used, when unpluck(in C language): if( buffer[ position ] != '\0' ) // to test the end of the text string { item_type = IT_CHAR; the_next_char = buffer[ position ]; position += 1; } else { position += 1; code = buffer[ position ]; size = ( buffer[ position ] & 0x07 ); position += 1; ............ } If UTF-16 is used, should I use if( buffer[ position ] != '\0' && buffer[ (*position)+1 ] != '\0' ) (I think NO, but I dont know how to test the end of the text string) Any suggestion? Thanks again, FZ

02-15-2006, 05:58 AM	#5
trfzwsc Junior Member Posts: 8 Karma: 10 Join Date: Feb 2006	Multibyte charset Yes, UTF-8 is OK. But when encoding GB2312/JIS/KOR, UTF-16 will result in a smaller(aprox. 30%) PDB file than UTF-8.

02-15-2006, 06:43 AM	#6
Laurens Jah Blessed Posts: 1,295 Karma: 1373 Join Date: Apr 2003 Location: The Netherlands Device: iPod Touch	Why Unicode? Why do you want to use Unicode anyway? AFAIK, there is no Plucker viewer that can handle UTF-8 or UTF-16. Nor does Sunrise add any encoding information to the Plucker document itself. (There should be an MIBenum in the document's meta record, but this field is never actually written.)

02-16-2006, 12:24 AM	#7
trfzwsc Junior Member Posts: 8 Karma: 10 Join Date: Feb 2006	Unicode is good for Chinese/Jananese/Korean The reason I like Unicode is that , AFASIK, Windows Mobile API only supports wide char. So in Chinese/Janapese/Korean, after uncompressing texts in the PDB files(whose charset may be GB2312/GBK/BIG5/JIS......), I must first convert the text strings into Unicode. If the html file is very large, it may take too many seconds to convert (Multibytetowidechar), too slow. So if the PDB file is encoded in Unicode-16, the resulted file is as small as in GB/BIG5..., while the speed is almost as fast as ASCII. Perfect! Yes, no plucker viewer support this feature. But I think it is not difficult to modify the open source viewer such as vade mecum. YOU SAID: (There should be an MIBenum in the document's meta record, but this field is never actually written.) Plucker Desktop writes the MIBenum in the document's meta record if the original html file has the meta charset information. I tested this. Sunrise should write it too because it is very important for non english user, especially for asian user. Last edited by trfzwsc; 02-16-2006 at 12:29 AM.

02-14-2006, 05:26 AM	#2
Laurens Jah Blessed Posts: 1,295 Karma: 1373 Join Date: Apr 2003 Location: The Netherlands Device: iPod Touch	No, AFAIK Plucker only handles 8-bit and some variable-length encodings. The list in the Sunrise interface shows all encodings supported by the Java installation, but not necessarily by Plucker.

02-15-2006, 03:29 AM	#4
Laurens Jah Blessed Posts: 1,295 Karma: 1373 Join Date: Apr 2003 Location: The Netherlands Device: iPod Touch	I'm not familiar with unpluck, so I can't help you there. The Plucker spec itself says the encoding should be 8-bit, however, so you shouldn't use 16-bit encodings anyway. Why not use UTF-8 encoding? A single NULL really marks the end of a text stream in this case.

02-16-2006, 04:28 AM	#8
Laurens Jah Blessed Posts: 1,295 Karma: 1373 Join Date: Apr 2003 Location: The Netherlands Device: iPod Touch	I wish Plucker supported Unicode! That would solve the whole character encoding issue. At any rate, UTF-8 would be the way to go, since this works well with standard string functions that assume NULL-terminated sequences and, more importantly, does not require a change to the Plucker specification. IIRC, MultiByteToWidechar() does not support UTF-8 under Windows Mobile, so you'd have to write something yourself. Very easy to do, however.

04-09-2006, 02:47 PM	#10
Laurens Jah Blessed Posts: 1,295 Karma: 1373 Join Date: Apr 2003 Location: The Netherlands Device: iPod Touch	Thanks for your detailed explanation, kostix. However, I won't get round to adding support for character sets to Sunrise XP anytime soon. I have this feature planned for v2.1, but that's at least several months away. (In the worst case, I might not get to it at all.)

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
UTF-8 encoding not accepted?	paulpeer	Sigil	5	12-09-2013 11:42 AM
Malformed byte sequence: Invalid byte 2 of 3-byte UTF-8 sequence. Check encoding	digireads	ePub	3	04-26-2011 04:07 AM
ePub Output Bug, Caused by MSWord	Daddy Warpig	Calibre	3	06-02-2010 10:03 AM
Video Encoding Time for ipad output	ja-mes	Apple Devices	1	04-27-2010 10:58 PM
Input/Output error	tulsa	Calibre	9	04-11-2010 11:51 AM