Yes the offset gumbo records is a byte offset from a start of a utf-8 encoded file or string. The column number is "proper" as it is measured in unicode code points not in bytes. The solution is to use the routine previously posted by Doitsu to convert line and column numbers inside python to an offset in unicode codepoints if that is what you want. Offsets are hard to work with given they are encoding dependent. Whereas line and column given in codepoints should be easier to work with and convert to any encoding you like.
KevinH
|