MobileRead Forums - View Single Post - track down the unknown exception in flightcrew

KevinH · 03-01-2018, 10:36 AM

Okay, I found out that my change just hid the problem by causing a normal error so flightcrew never reached the problem statement.

I rebuilt flightcrew with some debugging statements added and received the following url as causing the incorrect utf8-encoding problem:

Code:

Utf8PathToBoostPath: https://www.normacomics.com/ficha.as...crees_que_eres
Error during run: std::exception

A grep for this string in the book html found the following:

Code:

The_Bellybuttons.xhtml:
<a class="external text el" href="https://www.normacomics.com/ficha.asp?562829001/0/ombligos_01:_%BFtu_quien_te_crees_que_eres">

And in this case flightcrew is correct, the encoded part is %BF or the byte 0xbf which is in and of itself not a valid utf-8 character.

It is in fact an unconverted latin-1 character:
BF ¿ ¿ inverted question mark

So fc in trying to test this href tries to decode the path and finds a character that is not a utf-8 character.

The problem is that servers do not need to use utf-8 for their internal files and paths and as a result you can end up with strange urls that can not be represented in a utf-8 encoded xhtml file.

I am not sure what to do about this.

An epub with lots of external http:// links will simply not work on many e-readers and on e-readers that have no network connection.

AFAIK, it is illegal/discouraged in epub2 to load external resources that way. Epub2 is not a website in a box and was never meant to be.

They are allowed in epub3 for audio and video resources (not text) but only if marked as external resources and listed in the epub3 manifest as such.

The real issue is that Old servers can still use file paths and things that are in latin-1 (although this is highly discouraged) but urls in xhtml inside an epub must be represented in the encoding of the xhtml page and that is utf-8.

Either way, the correct thing to do in this case is not tell flightcrew your xhtml file is encoded at utf-8 and then have a url in it that is latin-1 encoded but hidden behind an urlencoding.

03-01-2018, 10:36 AM	#25
KevinH Sigil Developer Posts: 9,069 Karma: 6361556 Join Date: Nov 2009 Device: many	Okay, I found out that my change just hid the problem by causing a normal error so flightcrew never reached the problem statement. I rebuilt flightcrew with some debugging statements added and received the following url as causing the incorrect utf8-encoding problem: Code: Utf8PathToBoostPath: https://www.normacomics.com/ficha.as...crees_que_eres Error during run: std::exception A grep for this string in the book html found the following: Code: The_Bellybuttons.xhtml: <a class="external text el" href="https://www.normacomics.com/ficha.asp?562829001/0/ombligos_01:_%BFtu_quien_te_crees_que_eres"> And in this case flightcrew is correct, the encoded part is %BF or the byte 0xbf which is in and of itself not a valid utf-8 character. It is in fact an unconverted latin-1 character: BF ¿ ¿ inverted question mark So fc in trying to test this href tries to decode the path and finds a character that is not a utf-8 character. The problem is that servers do not need to use utf-8 for their internal files and paths and as a result you can end up with strange urls that can not be represented in a utf-8 encoded xhtml file. I am not sure what to do about this. An epub with lots of external http:// links will simply not work on many e-readers and on e-readers that have no network connection. AFAIK, it is illegal/discouraged in epub2 to load external resources that way. Epub2 is not a website in a box and was never meant to be. They are allowed in epub3 for audio and video resources (not text) but only if marked as external resources and listed in the epub3 manifest as such. The real issue is that Old servers can still use file paths and things that are in latin-1 (although this is highly discouraged) but urls in xhtml inside an epub must be represented in the encoding of the xhtml page and that is utf-8. Either way, the correct thing to do in this case is not tell flightcrew your xhtml file is encoded at utf-8 and then have a url in it that is latin-1 encoded but hidden behind an urlencoding. Last edited by KevinH; 03-01-2018 at 04:37 PM.