MobileRead Forums - View Single Post

KevinH · 01-02-2018, 12:03 PM

Yes the offset gumbo records is a byte offset from a start of a utf-8 encoded file or string. The column number is "proper" as it is measured in unicode code points not in bytes. The solution is to use the routine previously posted by Doitsu to convert line and column numbers inside python to an offset in unicode codepoints if that is what you want. Offsets are hard to work with given they are encoding dependent. Whereas line and column given in codepoints should be easier to work with and convert to any encoding you like.

KevinH

01-02-2018, 12:03 PM	#259
KevinH Sigil Developer Posts: 9,442 Karma: 6733960 Join Date: Nov 2009 Device: many	Yes the offset gumbo records is a byte offset from a start of a utf-8 encoded file or string. The column number is "proper" as it is measured in unicode code points not in bytes. The solution is to use the routine previously posted by Doitsu to convert line and column numbers inside python to an offset in unicode codepoints if that is what you want. Offsets are hard to work with given they are encoding dependent. Whereas line and column given in codepoints should be easier to work with and convert to any encoding you like. KevinH