Malformed byte sequence: Invalid byte 2 of 3-byte UTF-8 sequence. Check encoding

digireads · 06-09-2010, 01:14 PM

I am having trouble getting an ePUB file to validate. I get the following error message:

Malformed byte sequence: Invalid byte 2 of 3-byte UTF-8 sequence. Check encoding.

Can anyone tell me what I should do to my source file to correct this?

Thanks,

Kevin

charleski · 06-09-2010, 02:39 PM

Somewhere in your production chain you have an editor that's not handling UTF-8 properly and is inserting garbage that's being interpreted as a UTF-16 surrogate. You need to fix this or you'll run into encoding errors again in the future.

To fix the current problem, open the affected file in Notepad++ and use that to convert the encoding (in the Format menu). You may need to track down and change the character that's been mangled by the misbehaving editor.

jastern · 04-25-2011, 09:02 PM

one of the problems with this version of epubcheck (all versions through 1.2 as of april 25, 2011) is that the error message fails to give enough information to help you (or me) know what to do with the file.

information that would be helpful would be the exact line number and character number in that line (i.e., "row and column") where the problem exists.

without that information, and with a huge file, it's much more of a guessing game.

compare the usefulness of this error message i get from epubcheck:

Code:

$ epubcheck fp.epub
ERROR: fp.epub/Ops/037.html: Malformed byte sequence: Invalid byte 1 of 1-byte UTF-8 sequence. Check encoding
$

with the error message i get from the command-line utility, "isutf8" (available, for instance, in the "moreutils" package on Ubuntu Linux):

Code:

$ isutf8 037.html
037.html: line 19, char 1, byte offset 1921: invalid UTF-8 code
$

doesn't that seem much more helpful to know exactly which line and character on that line, is giving the problem? i'll bet if you had that, you wouldn't have had to even post the question.

however, in my tests, i find that even isutf8 is not as helpful as it could be, since the problem, while it is on line 19, is not at character 1 on that line in my sample file. it is much further out on line 19 (that's a long line in my file).

the particular software that worked for me was emacs, because when i opened the file and then tried to save it, it gave me this message:

Code:

These default coding systems were tried to encode text
in the buffer `037.html':
  (utf-8-dos (63433 . 4194300))
However, each of them encountered characters it couldn't encode:
  utf-8-dos cannot encode these: \374

Click on a character (or switch to this window by `C-x o'
and select the characters by RET) to jump to the place it appears,
where `C-u C-x =' will give information about it.

Select one of the safe coding systems listed below,
or cancel the writing with C-g and edit the buffer
   to remove or modify the problematic characters,
or specify any other coding system (and risk losing
   the problematic characters).

  raw-text emacs-mule no-conversion

and when i clicked on the \374, it took me to precisely the place in the buffer where the exact problem was. i could see it needed to be replaced with a "ü".

take-away for all of us programmers: when we create error messages, it is so much more helpful and time-saving for end-users if we take the time to:

tell the end-user exactly where the problem is and
what to do about it, if at all possible. and
we need to make sure that information is accurate.

e.g., "Please open file 037.html with a UTF-8 capable text editor, or hex editor, etc., and navigate to line 19, character 171, and see what is under the cursor at that point, and replace it with a character which is encoded correctly in UTF-8."

yes, this takes one person (us!) some time. but it saves humanity many times that.

Toxaris · 04-26-2011, 04:07 AM

Please, through epubcheck out of the windows. The messages are too cryptic and unusable.

Usually Flightcrew gives better results which are usually better understandable.

06-09-2010, 01:14 PM	#1
digireads Junior Member Posts: 2 Karma: 10 Join Date: Jun 2010 Device: none	Malformed byte sequence: Invalid byte 2 of 3-byte UTF-8 sequence. Check encoding I am having trouble getting an ePUB file to validate. I get the following error message: Malformed byte sequence: Invalid byte 2 of 3-byte UTF-8 sequence. Check encoding. Can anyone tell me what I should do to my source file to correct this? Thanks, Kevin

04-25-2011, 09:02 PM	#3
jastern Junior Member Posts: 1 Karma: 10 Join Date: Jun 2010 Device: none	what is missing from the epubcheck error message one of the problems with this version of epubcheck (all versions through 1.2 as of april 25, 2011) is that the error message fails to give enough information to help you (or me) know what to do with the file. information that would be helpful would be the exact line number and character number in that line (i.e., "row and column") where the problem exists. without that information, and with a huge file, it's much more of a guessing game. compare the usefulness of this error message i get from epubcheck: Code: $ epubcheck fp.epub ERROR: fp.epub/Ops/037.html: Malformed byte sequence: Invalid byte 1 of 1-byte UTF-8 sequence. Check encoding $ with the error message i get from the command-line utility, "isutf8" (available, for instance, in the "moreutils" package on Ubuntu Linux): Code: $ isutf8 037.html 037.html: line 19, char 1, byte offset 1921: invalid UTF-8 code $ doesn't that seem much more helpful to know exactly which line and character on that line, is giving the problem? i'll bet if you had that, you wouldn't have had to even post the question. however, in my tests, i find that even isutf8 is not as helpful as it could be, since the problem, while it is on line 19, is not at character 1 on that line in my sample file. it is much further out on line 19 (that's a long line in my file). the particular software that worked for me was emacs, because when i opened the file and then tried to save it, it gave me this message: Code: These default coding systems were tried to encode text in the buffer `037.html': (utf-8-dos (63433 . 4194300)) However, each of them encountered characters it couldn't encode: utf-8-dos cannot encode these: \374 Click on a character (or switch to this window by `C-x o' and select the characters by RET) to jump to the place it appears, where `C-u C-x =' will give information about it. Select one of the safe coding systems listed below, or cancel the writing with C-g and edit the buffer to remove or modify the problematic characters, or specify any other coding system (and risk losing the problematic characters). raw-text emacs-mule no-conversion and when i clicked on the \374, it took me to precisely the place in the buffer where the exact problem was. i could see it needed to be replaced with a "ü". take-away for all of us programmers: when we create error messages, it is so much more helpful and time-saving for end-users if we take the time to: tell the end-user exactly where the problem is and what to do about it, if at all possible. and we need to make sure that information is accurate. e.g., "Please open file 037.html with a UTF-8 capable text editor, or hex editor, etc., and navigate to line 19, character 171, and see what is under the cursor at that point, and replace it with a character which is encoded correctly in UTF-8." yes, this takes one person (us!) some time. but it saves humanity many times that. Last edited by jastern; 04-25-2011 at 09:18 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Series sequence?	Toxaris	Sony Reader	9	04-09-2010 08:36 PM
Series sequence?	Toxaris	Calibre	5	04-09-2010 08:04 PM
folder sequence problem	sparrow_knight	Calibre	5	12-14-2009 09:05 PM
PRS-300 Author sequence	denmarks	Sony Reader	1	10-06-2009 12:49 AM
Asian 2 Byte Language Support?	masa	Sony Reader	8	11-16-2006 09:38 PM

06-09-2010, 02:39 PM	#2
charleski Wizard Posts: 1,196 Karma: 1281258 Join Date: Sep 2009 Device: PRS-505	Somewhere in your production chain you have an editor that's not handling UTF-8 properly and is inserting garbage that's being interpreted as a UTF-16 surrogate. You need to fix this or you'll run into encoding errors again in the future. To fix the current problem, open the affected file in Notepad++ and use that to convert the encoding (in the Format menu). You may need to track down and change the character that's been mangled by the misbehaving editor.

04-26-2011, 04:07 AM	#4
Toxaris Wizard Posts: 4,520 Karma: 121692313 Join Date: Oct 2009 Location: Heemskerk, NL Device: PRS-T1, Kobo Touch, Kobo Aura	Please, through epubcheck out of the windows. The messages are too cryptic and unusable. Usually Flightcrew gives better results which are usually better understandable.

Advert