View Full Version : Malformed byte sequence: Invalid byte 2 of 3-byte UTF-8 sequence. Check encoding


digireads
06-09-2010, 01:14 PM
I am having trouble getting an ePUB file to validate. I get the following error message:

Malformed byte sequence: Invalid byte 2 of 3-byte UTF-8 sequence. Check encoding.

Can anyone tell me what I should do to my source file to correct this?

Thanks,

Kevin

charleski
06-09-2010, 02:39 PM
Somewhere in your production chain you have an editor that's not handling UTF-8 properly and is inserting garbage that's being interpreted as a UTF-16 surrogate. You need to fix this or you'll run into encoding errors again in the future.

To fix the current problem, open the affected file in Notepad++ and use that to convert the encoding (in the Format menu). You may need to track down and change the character that's been mangled by the misbehaving editor.

jastern
04-25-2011, 09:02 PM
one of the problems with this version of epubcheck (http://code.google.com/p/epubcheck/) (all versions through 1.2 as of april 25, 2011) is that the error message fails to give enough information to help you (or me) know what to do with the file.

information that would be helpful would be the exact line number and character number in that line (i.e., "row and column") where the problem exists.

without that information, and with a huge file, it's much more of a guessing game.

compare the usefulness of this error message i get from epubcheck:

$ epubcheck fp.epub
ERROR: fp.epub/Ops/037.html: Malformed byte sequence: Invalid byte 1 of 1-byte UTF-8 sequence. Check encoding
$

with the error message i get from the command-line utility, "isutf8" (available, for instance, in the "moreutils" package on Ubuntu Linux):

$ isutf8 037.html
037.html: line 19, char 1, byte offset 1921: invalid UTF-8 code
$

doesn't that seem much more helpful to know exactly which line and character on that line, is giving the problem? i'll bet if you had that, you wouldn't have had to even post the question.

however, in my tests, i find that even isutf8 is not as helpful as it could be, since the problem, while it is on line 19, is not at character 1 on that line in my sample file. it is much further out on line 19 (that's a long line in my file).

the particular software that worked for me was emacs (http://www.gnu.org/software/emacs/), because when i opened the file and then tried to save it, it gave me this message:


These default coding systems were tried to encode text
in the buffer `037.html':
(utf-8-dos (63433 . 4194300))
However, each of them encountered characters it couldn't encode:
utf-8-dos cannot encode these: \374

Click on a character (or switch to this window by `C-x o'
and select the characters by RET) to jump to the place it appears,
where `C-u C-x =' will give information about it.

Select one of the safe coding systems listed below,
or cancel the writing with C-g and edit the buffer
to remove or modify the problematic characters,
or specify any other coding system (and risk losing
the problematic characters).

raw-text emacs-mule no-conversion


and when i clicked on the \374, it took me to precisely the place in the buffer where the exact problem was. i could see it needed to be replaced with a "".

take-away for all of us programmers: when we create error messages, it is so much more helpful and time-saving for end-users if we take the time to:


tell the end-user exactly where the problem is and
what to do about it, if at all possible. and
we need to make sure that information is accurate.


e.g., "Please open file 037.html with a UTF-8 capable text editor, or hex editor, etc., and navigate to line 19, character 171, and see what is under the cursor at that point, and replace it with a character which is encoded correctly in UTF-8."

yes, this takes one person (us!) some time. but it saves humanity many times that.

Toxaris
04-26-2011, 04:07 AM
Please, through epubcheck out of the windows. The messages are too cryptic and unusable.

Usually Flightcrew gives better results which are usually better understandable.