Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > Miscellaneous > Archive > Sunrise

Notices

 
 
Thread Tools Search this Thread
Old 02-14-2006, 03:58 AM   #1
trfzwsc
Junior Member
trfzwsc began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Feb 2006
Bug: UTF-16 (Output Encoding) error?

To whom it may concern,

If I try to use UTF-16, UTF-16BE, UTF-16LE as "Output Encoding",
the result PDB file can be generated successfully, but it is unusable.
Is it a bug?

FZ
trfzwsc is offline  
Old 02-14-2006, 04:26 AM   #2
Laurens
Jah Blessed
Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.
 
Laurens's Avatar
 
Posts: 1,295
Karma: 1373
Join Date: Apr 2003
Location: The Netherlands
Device: iPod Touch
No, AFAIK Plucker only handles 8-bit and some variable-length encodings. The list in the Sunrise interface shows all encodings supported by the Java installation, but not necessarily by Plucker.
Laurens is offline  
Old 02-14-2006, 08:54 PM   #3
trfzwsc
Junior Member
trfzwsc began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Feb 2006
How to unpluck UTF-16-encoded PDB files?

Thanks,

OK, if I use UTF-16 as "Output Encoding", how can I unpluck the generated PDB file?

For example, if ASCII is used, when unpluck(in C language):

if( buffer[ *position ] != '\0' ) // to test the end of the text string
{
item_type = IT_CHAR;
the_next_char = buffer[ *position ];
*position += 1;
}
else
{
*position += 1;

code = buffer[ *position ];
size = ( buffer[ *position ] & 0x07 );

*position += 1;

............
}

If UTF-16 is used, should I use

if( buffer[ *position ] != '\0' && buffer[ (*position)+1 ] != '\0' )

(I think NO, but I dont know how to test the end of the text string)

Any suggestion?

Thanks again,
FZ
trfzwsc is offline  
Old 02-15-2006, 02:29 AM   #4
Laurens
Jah Blessed
Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.
 
Laurens's Avatar
 
Posts: 1,295
Karma: 1373
Join Date: Apr 2003
Location: The Netherlands
Device: iPod Touch
I'm not familiar with unpluck, so I can't help you there. The Plucker spec itself says the encoding should be 8-bit, however, so you shouldn't use 16-bit encodings anyway.

Why not use UTF-8 encoding? A single NULL really marks the end of a text stream in this case.
Laurens is offline  
Old 02-15-2006, 04:58 AM   #5
trfzwsc
Junior Member
trfzwsc began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Feb 2006
Multibyte charset

Yes, UTF-8 is OK. But when encoding GB2312/JIS/KOR, UTF-16 will result in a smaller(aprox. 30%) PDB file than UTF-8.
trfzwsc is offline  
Old 02-15-2006, 05:43 AM   #6
Laurens
Jah Blessed
Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.
 
Laurens's Avatar
 
Posts: 1,295
Karma: 1373
Join Date: Apr 2003
Location: The Netherlands
Device: iPod Touch
Question Why Unicode?

Why do you want to use Unicode anyway? AFAIK, there is no Plucker viewer that can handle UTF-8 or UTF-16. Nor does Sunrise add any encoding information to the Plucker document itself. (There should be an MIBenum in the document's meta record, but this field is never actually written.)
Laurens is offline  
Old 02-15-2006, 11:24 PM   #7
trfzwsc
Junior Member
trfzwsc began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Feb 2006
Unicode is good for Chinese/Jananese/Korean



The reason I like Unicode is that , AFASIK, Windows Mobile API only supports wide char. So in Chinese/Janapese/Korean, after uncompressing texts in the PDB files(whose charset may be GB2312/GBK/BIG5/JIS......), I must first convert the text strings into Unicode. If the html file is very large, it may take too many seconds to convert (Multibytetowidechar), too slow. So if the PDB file is encoded in Unicode-16, the resulted file is as small as in GB/BIG5..., while the speed is almost as fast as ASCII. Perfect!

Yes, no plucker viewer support this feature. But I think it is not difficult to modify the open source viewer such as vade mecum.




YOU SAID: (There should be an MIBenum in the document's meta record, but this field is never actually written.)

Plucker Desktop writes the MIBenum in the document's meta record if the original html file has the meta charset information. I tested this.
Sunrise should write it too because it is very important for non english user, especially for asian user.


Last edited by trfzwsc; 02-15-2006 at 11:29 PM.
trfzwsc is offline  
Old 02-16-2006, 03:28 AM   #8
Laurens
Jah Blessed
Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.
 
Laurens's Avatar
 
Posts: 1,295
Karma: 1373
Join Date: Apr 2003
Location: The Netherlands
Device: iPod Touch
I wish Plucker supported Unicode! That would solve the whole character encoding issue. At any rate, UTF-8 would be the way to go, since this works well with standard string functions that assume NULL-terminated sequences and, more importantly, does not require a change to the Plucker specification. IIRC, MultiByteToWidechar() does not support UTF-8 under Windows Mobile, so you'd have to write something yourself. Very easy to do, however.
Laurens is offline  
Old 04-02-2006, 07:58 AM   #9
kostix
Vade-Mecum developer
kostix began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Apr 2006
Location: Saint-Petersburg, Russia
Device: LOOX 420, Palm m500
WM does support UTF-8, actually

Quote:
Originally Posted by Laurens
I wish Plucker supported Unicode! That would solve the whole character encoding issue. At any rate, UTF-8 would be the way to go, since this works well with standard string functions that assume NULL-terminated sequences and, more importantly, does not require a change to the Plucker specification. IIRC, MultiByteToWidechar() does not support UTF-8 under Windows Mobile, so you'd have to write something yourself. Very easy to do, however.
I think there's no point in solving encoding issues *that* way.
Plucker DB supports any 8-bit and 7-bit encoding. Unfortunately it, too, falls to the pit of confusing encodings with charsets, but that doesn't stop it from supporting any "8-bit" charset and UTF-8. Furthermore, Plucker DB can have different charsets for different (sequences of) records. This is achieved by using the ExceptionalCharset metadata fields.

I've added support for charsets to the recent Vade-Mecum (0.6.6). It understands any charset for which WinCE on a target device has corresponding code page. This includes UTF-8 and UTF-7 that *are* supported at least in WM2003. Different charsets for different records are supported too.

What's missing, is the support for charsets for the sequences of "linked" records (those having "Click here for the next/previuos part" links) since the reference (Python) distiller attaches an ExceptionalCharset metadata block to the first record of such sequences only. It does so also for the sequences of continued records, but that's easily handled. Handling this behaviour for linked records is a hell, but I'm working on this.

If you wish I can help you handle charset issues in Sunrise XP since I have some degree of understanding about how this is implemented in the reference distiller. And at least you can grab the source of VM and read this:
vim-charsets.html

I think we should move to e-mail to speed up the efforts.
You can mail me at flatworm{}users.sourceforge.net
and khomoutov{}gmail.com
kostix is offline  
Old 04-09-2006, 01:47 PM   #10
Laurens
Jah Blessed
Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.Laurens is no ebook tyro.
 
Laurens's Avatar
 
Posts: 1,295
Karma: 1373
Join Date: Apr 2003
Location: The Netherlands
Device: iPod Touch
Thanks for your detailed explanation, kostix. However, I won't get round to adding support for character sets to Sunrise XP anytime soon. I have this feature planned for v2.1, but that's at least several months away. (In the worst case, I might not get to it at all.)
Laurens is offline  
 


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
UTF-8 encoding not accepted? paulpeer Sigil 5 12-09-2013 10:42 AM
Malformed byte sequence: Invalid byte 2 of 3-byte UTF-8 sequence. Check encoding digireads ePub 3 04-26-2011 03:07 AM
ePub Output Bug, Caused by MSWord Daddy Warpig Calibre 3 06-02-2010 09:03 AM
Video Encoding Time for ipad output ja-mes Apple Devices 1 04-27-2010 09:58 PM
Input/Output error tulsa Calibre 9 04-11-2010 10:51 AM


All times are GMT -4. The time now is 07:49 AM.


MobileRead.com is a privately owned, operated and funded community.