MobileRead Forums - View Single Post - [Plugin] OpenDocHTMLImport

slowsmile · 01-13-2017, 06:23 PM

@Doitsu...Thanks for your advice -- I've already incorporated getprefferred encoding() into my encoding check function. It all really depends on which method of obtaining the encoding you trust the most. I trust the html meta tag method the most while trusting the chardet method the least. I use the getprefferredencoding() method as a fallback. The chardet encoding results are also terribly inaccurate which is why I'll be testing UnicodeDammit/detwingle today as a possible substitute.

The check encoding function passes the discovered encoding to another function that converts the input file to utf-8. If there is an error in this converter function -- due to the discovered encoding being wrong -- then this function will throw an exception and the plugin app will not continue.

My function(which I'm still testing) now looks like this:

Spoiler:

def checkFileEncoding(file):

html_encoding = None

chardet_encoding = ''
final_encoding = ''

# get the encoding info from the html meta headers
text = open(file, 'rt', encoding='iso-8859-1', errors='surrogateescape').read(2048)

if 'charset=windows-1252' in text.lower():

html_encoding = 'cp1252'

elif 'charset=windows-1250' in text.lower():

html_encoding = 'cp1250'

elif 'charset=windows-1253' in text.lower():

html_encoding = 'cp1253'

elif 'charset=windows-1254' in text.lower():

html_encoding = 'cp1254'

elif 'charset=windows-1251' in text.lower():

html_encoding = 'cp1251'

elif 'charset=windows-1255' in text.lower():

html_encoding = 'cp1255'

elif 'charset=windows-1256' in text.lower():

html_encoding = 'cp1256'

elif 'charset=windows-1257' in text.lower():

html_encoding = 'cp1257'

elif 'charset=us-ascii' in text.lower():

html_encoding = 'us-ascii'

elif 'charset=ibm437' in text.lower():

html_encoding = 'cp437'

elif 'charset=ibm850' in text.lower():

html_encoding = 'cp850'

elif 'charset=ibm852' in text.lower():

html_encoding = 'cp852'

elif 'charset=ibm855' in text.lower():

html_encoding = 'cp855'

elif 'charset=iso-8859-1' in text.lower():

html_encoding = 'iso-8859-1'

elif 'charset=iso-8859-2' in text.lower():

html_encoding = 'iso-8859-2'

elif 'charset=iso-8859-4' in text.lower():

html_encoding = 'iso-8859-4'

elif 'charset=utf-8' in text.lower():

html_encoding = 'utf-8'

else:

# get the locale encoding, if needed

html_encoding = locale.getpreferredencoding()

# now get the file encoding using chardet
rawdata = codecs.open(file, "rb").read(2048)
result = chardet.detect(rawdata)
chardet_encoding = result['encoding']

print(' >>> html enc...' + html_encoding)
print(' >>> chardet enc...' + chardet_encoding)

# compare the html and chardet encodings
final_encoding = chardet_encoding
if html_encoding != None and chardet_encoding.upper() != html_encoding.upper():

final_encoding = html_encoding

print(' -- Input file encoding is: ' + final_encoding.upper())
return(final_encoding)

01-13-2017, 06:23 PM	#37
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	@Doitsu...Thanks for your advice -- I've already incorporated getprefferred encoding() into my encoding check function. It all really depends on which method of obtaining the encoding you trust the most. I trust the html meta tag method the most while trusting the chardet method the least. I use the getprefferredencoding() method as a fallback. The chardet encoding results are also terribly inaccurate which is why I'll be testing UnicodeDammit/detwingle today as a possible substitute. The check encoding function passes the discovered encoding to another function that converts the input file to utf-8. If there is an error in this converter function -- due to the discovered encoding being wrong -- then this function will throw an exception and the plugin app will not continue. My function(which I'm still testing) now looks like this: Spoiler: def checkFileEncoding(file): html_encoding = None chardet_encoding = '' final_encoding = '' # get the encoding info from the html meta headers text = open(file, 'rt', encoding='iso-8859-1', errors='surrogateescape').read(2048) if 'charset=windows-1252' in text.lower(): html_encoding = 'cp1252' elif 'charset=windows-1250' in text.lower(): html_encoding = 'cp1250' elif 'charset=windows-1253' in text.lower(): html_encoding = 'cp1253' elif 'charset=windows-1254' in text.lower(): html_encoding = 'cp1254' elif 'charset=windows-1251' in text.lower(): html_encoding = 'cp1251' elif 'charset=windows-1255' in text.lower(): html_encoding = 'cp1255' elif 'charset=windows-1256' in text.lower(): html_encoding = 'cp1256' elif 'charset=windows-1257' in text.lower(): html_encoding = 'cp1257' elif 'charset=us-ascii' in text.lower(): html_encoding = 'us-ascii' elif 'charset=ibm437' in text.lower(): html_encoding = 'cp437' elif 'charset=ibm850' in text.lower(): html_encoding = 'cp850' elif 'charset=ibm852' in text.lower(): html_encoding = 'cp852' elif 'charset=ibm855' in text.lower(): html_encoding = 'cp855' elif 'charset=iso-8859-1' in text.lower(): html_encoding = 'iso-8859-1' elif 'charset=iso-8859-2' in text.lower(): html_encoding = 'iso-8859-2' elif 'charset=iso-8859-4' in text.lower(): html_encoding = 'iso-8859-4' elif 'charset=utf-8' in text.lower(): html_encoding = 'utf-8' else: # get the locale encoding, if needed html_encoding = locale.getpreferredencoding() # now get the file encoding using chardet rawdata = codecs.open(file, "rb").read(2048) result = chardet.detect(rawdata) chardet_encoding = result['encoding'] print(' >>> html enc...' + html_encoding) print(' >>> chardet enc...' + chardet_encoding) # compare the html and chardet encodings final_encoding = chardet_encoding if html_encoding != None and chardet_encoding.upper() != html_encoding.upper(): final_encoding = html_encoding print(' -- Input file encoding is: ' + final_encoding.upper()) return(final_encoding) Last edited by slowsmile; 01-13-2017 at 06:52 PM.