View Single Post
Old 12-14-2009, 04:57 PM   #11
Valloric
Created Sigil, FlightCrew
Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.
 
Valloric's Avatar
 
Posts: 1,978
Karma: 350515
Join Date: Feb 2008
Device: Sony Reader PRS 505
Quote:
Originally Posted by kovidgoyal View Post
I chose chardet because it's optimized for documents from the web and in practice most encoding issues arise with HTML files from the web.
Yeah, but I only need the last part, the heuristic and statistical analysis. Sigil currently covers the following:
  • (X)HTML files that specify the encoding with the <meta> tag
  • (X)HTML files that specify the encoding with the XML "encoding" attribute
  • UTF-16 and UTF-32 (BE and LE) when they're not specified, through BOM detection
  • UTF-8 when not specified, through byte stream fingerprinting
All that's left are the files with an unspecified regional encoding. And for that I only need the analysis algorithms. And those from ICU are second to none.

But I'm not ruling out chardet yet. ICU is very big, and getting it to work under CMake... we'll it may turn out that pulling out the chardet sources and their dependencies from Mozilla trunk is less painful. We'll see.

Quote:
Originally Posted by kovidgoyal View Post
And it has a great python library
Rub it in.
Valloric is offline   Reply With Quote