MobileRead Forums - View Single Post

jgoguen · 02-12-2013, 06:50 PM

I'm trying to find a way to detect the proper encoding of a file. In most cases, it's correct but there's a few cases with short files where it isn't so good. I know that short files generally yield bad results, but I'm hoping there's something I'm doing wrong or that could be improved.

In my plugin, I'm using the Container's get_raw function to get the raw data for files. The file I'm running into trouble with (I'm sure there's others I haven't noticed yet) has these contents:

Spoiler:

The chardet library detects this as ISO-8859-2 with confidence 85.5%, but decoding as ISO-8859-2 gives this:

Spoiler:

Obviously this is wrong, the characters are corrupted, but how (or can) I tell when a "successful" decode produces "bad" output?

For now, I'm going to proceed simply assuming that files are encoded correctly and only use chardet as a fallback in the event that an exception is raised. Is there a better way to handle this to improve character set detection, or is this about as good as I can expect because the sample is so small?

02-12-2013, 06:50 PM	#1
jgoguen Generally Awesome Person Posts: 1,061 Karma: 2178845 Join Date: Jan 2013 Location: /dev/kmem Device: Kobo Clara HD, Kindle Oasis	Character detection issues I'm trying to find a way to detect the proper encoding of a file. In most cases, it's correct but there's a few cases with short files where it isn't so good. I know that short files generally yield bad results, but I'm hoping there's something I'm doing wrong or that could be improved. In my plugin, I'm using the Container's get_raw function to get the raw data for files. The file I'm running into trouble with (I'm sure there's others I haven't noticed yet) has these contents: Spoiler: <?xml version="1.0" encoding="utf-8" standalone="no"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <link href="../Styles/style001.css" rel="stylesheet" type="text/css" /> <title></title> </head> <body> <h1 id="heading_id_1">Ringraziamenti</h1> <p class="norm">Con amore e tantissimi grazie a mia madre, Claudia, e a mia sorella Jemima, per l'aiuto e il sostegno. Infinita riconoscenza a tutti coloro che mi hanno fornito riscontri e consigli, in particolare Scott Bicheno, Max Schaefer, Simon Kavanagh e Oliver Cheetham.</p> <p class="norm">Profondo amore e gratitudine a Emma Bircham, ancora e per sempre.</p> <p class="norm">Grazie a tutti quelli della Macmillan, soprattutto al mio editor Peter Lavery per l'incredibile appoggio. E immensa gratitudine a Mic Cheetham, che mi ha aiutato più di quanto sappia esprimere.</p> <p class="norm">Non ho spazio sufficiente a ringraziare tutti gli scrittori che hanno esercitato una grande influenza su di me, ma voglio menzionarne due la cui opera è costante fonte di ispirazione e stupore. Dunque, a M. John Harrison, e alla memoria di Mervyn Peake, la mia umile e sentita riconoscenza.</p> <p class="norm">Senza di loro non avrei mai potuto scrivere questo libro.</p> <p class="norm"> </p> </body> </html> The chardet library detects this as ISO-8859-2 with confidence 85.5%, but decoding as ISO-8859-2 gives this: Spoiler: ?xml version="1.0" encoding="utf-8" standalone="no"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <link href="../Styles/style001.css" rel="stylesheet" type="text/css" /> <title></title> </head> <body> <h1 id="heading_id_1">Ringraziamenti</h1> <p class="norm">Con amore e tantissimi grazie a mia madre, Claudia, e a mia sorella Jemima, per l'aiuto e il sostegno. Infinita riconoscenza a tutti coloro che mi hanno fornito riscontri e consigli, in particolare Scott Bicheno, Max Schaefer, Simon Kavanagh e Oliver Cheetham.</p> <p class="norm">Profondo amore e gratitudine a Emma Bircham, ancora e per sempre.</p> <p class="norm">Grazie a tutti quelli della Macmillan, soprattutto al mio editor Peter Lavery per l'incredibile appoggio. E immensa gratitudine a Mic Cheetham, che mi ha aiutato piĂš di quanto sappia esprimere.</p> <p class="norm">Non ho spazio sufficiente a ringraziare tutti gli scrittori che hanno esercitato una grande influenza su di me, ma voglio menzionarne due la cui opera Ă¨ costante fonte di ispirazione e stupore. Dunque, a M. John Harrison, e alla memoria di Mervyn Peake, la mia umile e sentita riconoscenza.</p> <p class="norm">Senza di loro non avrei mai potuto scrivere questo libro.</p> <p class="norm"> </p> </body> </html> Obviously this is wrong, the characters are corrupted, but how (or can) I tell when a "successful" decode produces "bad" output? For now, I'm going to proceed simply assuming that files are encoded correctly and only use chardet as a fallback in the event that an exception is raised. Is there a better way to handle this to improve character set detection, or is this about as good as I can expect because the sample is so small?