MobileRead Forums - View Single Post - Malformed byte sequence: Invalid byte 2 of 3-byte UTF-8 sequence. Check encoding

jastern · 04-25-2011, 08:02 PM

one of the problems with this version of epubcheck (all versions through 1.2 as of april 25, 2011) is that the error message fails to give enough information to help you (or me) know what to do with the file.

information that would be helpful would be the exact line number and character number in that line (i.e., "row and column") where the problem exists.

without that information, and with a huge file, it's much more of a guessing game.

compare the usefulness of this error message i get from epubcheck:

Code:

$ epubcheck fp.epub
ERROR: fp.epub/Ops/037.html: Malformed byte sequence: Invalid byte 1 of 1-byte UTF-8 sequence. Check encoding
$

with the error message i get from the command-line utility, "isutf8" (available, for instance, in the "moreutils" package on Ubuntu Linux):

Code:

$ isutf8 037.html
037.html: line 19, char 1, byte offset 1921: invalid UTF-8 code
$

doesn't that seem much more helpful to know exactly which line and character on that line, is giving the problem? i'll bet if you had that, you wouldn't have had to even post the question.

however, in my tests, i find that even isutf8 is not as helpful as it could be, since the problem, while it is on line 19, is not at character 1 on that line in my sample file. it is much further out on line 19 (that's a long line in my file).

the particular software that worked for me was emacs, because when i opened the file and then tried to save it, it gave me this message:

Code:

These default coding systems were tried to encode text
in the buffer `037.html':
  (utf-8-dos (63433 . 4194300))
However, each of them encountered characters it couldn't encode:
  utf-8-dos cannot encode these: \374

Click on a character (or switch to this window by `C-x o'
and select the characters by RET) to jump to the place it appears,
where `C-u C-x =' will give information about it.

Select one of the safe coding systems listed below,
or cancel the writing with C-g and edit the buffer
   to remove or modify the problematic characters,
or specify any other coding system (and risk losing
   the problematic characters).

  raw-text emacs-mule no-conversion

and when i clicked on the \374, it took me to precisely the place in the buffer where the exact problem was. i could see it needed to be replaced with a "ü".

take-away for all of us programmers: when we create error messages, it is so much more helpful and time-saving for end-users if we take the time to:

tell the end-user exactly where the problem is and
what to do about it, if at all possible. and
we need to make sure that information is accurate.

e.g., "Please open file 037.html with a UTF-8 capable text editor, or hex editor, etc., and navigate to line 19, character 171, and see what is under the cursor at that point, and replace it with a character which is encoded correctly in UTF-8."

yes, this takes one person (us!) some time. but it saves humanity many times that.

04-25-2011, 08:02 PM	#3
jastern Junior Member Posts: 1 Karma: 10 Join Date: Jun 2010 Device: none	what is missing from the epubcheck error message one of the problems with this version of epubcheck (all versions through 1.2 as of april 25, 2011) is that the error message fails to give enough information to help you (or me) know what to do with the file. information that would be helpful would be the exact line number and character number in that line (i.e., "row and column") where the problem exists. without that information, and with a huge file, it's much more of a guessing game. compare the usefulness of this error message i get from epubcheck: Code: $ epubcheck fp.epub ERROR: fp.epub/Ops/037.html: Malformed byte sequence: Invalid byte 1 of 1-byte UTF-8 sequence. Check encoding $ with the error message i get from the command-line utility, "isutf8" (available, for instance, in the "moreutils" package on Ubuntu Linux): Code: $ isutf8 037.html 037.html: line 19, char 1, byte offset 1921: invalid UTF-8 code $ doesn't that seem much more helpful to know exactly which line and character on that line, is giving the problem? i'll bet if you had that, you wouldn't have had to even post the question. however, in my tests, i find that even isutf8 is not as helpful as it could be, since the problem, while it is on line 19, is not at character 1 on that line in my sample file. it is much further out on line 19 (that's a long line in my file). the particular software that worked for me was emacs, because when i opened the file and then tried to save it, it gave me this message: Code: These default coding systems were tried to encode text in the buffer `037.html': (utf-8-dos (63433 . 4194300)) However, each of them encountered characters it couldn't encode: utf-8-dos cannot encode these: \374 Click on a character (or switch to this window by `C-x o' and select the characters by RET) to jump to the place it appears, where `C-u C-x =' will give information about it. Select one of the safe coding systems listed below, or cancel the writing with C-g and edit the buffer to remove or modify the problematic characters, or specify any other coding system (and risk losing the problematic characters). raw-text emacs-mule no-conversion and when i clicked on the \374, it took me to precisely the place in the buffer where the exact problem was. i could see it needed to be replaced with a "ü". take-away for all of us programmers: when we create error messages, it is so much more helpful and time-saving for end-users if we take the time to: tell the end-user exactly where the problem is and what to do about it, if at all possible. and we need to make sure that information is accurate. e.g., "Please open file 037.html with a UTF-8 capable text editor, or hex editor, etc., and navigate to line 19, character 171, and see what is under the cursor at that point, and replace it with a character which is encoded correctly in UTF-8." yes, this takes one person (us!) some time. but it saves humanity many times that. Last edited by jastern; 04-25-2011 at 08:18 PM.