View Single Post
Old 01-13-2017, 10:05 PM   #39
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,647
Karma: 5433388
Join Date: Nov 2009
Device: many
slowsmile,

Sigil uses the following code to identify the encoding of an html file when File->Open is run on one:

https://github.com/Sigil-Ebook/Sigil...ngResolver.cpp

The algorithm looks like about like this:
- read file in bytes
- check first 4 bytes for byte order marks to id utf-8, utf-16le, utf-16be, utf-32le, utf-32be
- convert up to 1024 of first bytes to string using utf-8 ignoring errors to create text snippet
- use regular expressions on snippet to look for encoding or charset attributes with or without delimiters to extract encoding name and use that codec to covert it
- if all else fails, quick parse entire file as utf-8 and if no errors use utf-8
- finally just use the local encoding

Hope this helps,
KevinH
KevinH is offline   Reply With Quote