Character detection issues

jgoguen · 02-12-2013, 07:50 PM

I'm trying to find a way to detect the proper encoding of a file. In most cases, it's correct but there's a few cases with short files where it isn't so good. I know that short files generally yield bad results, but I'm hoping there's something I'm doing wrong or that could be improved.

In my plugin, I'm using the Container's get_raw function to get the raw data for files. The file I'm running into trouble with (I'm sure there's others I haven't noticed yet) has these contents:

Spoiler:

The chardet library detects this as ISO-8859-2 with confidence 85.5%, but decoding as ISO-8859-2 gives this:

Spoiler:

Obviously this is wrong, the characters are corrupted, but how (or can) I tell when a "successful" decode produces "bad" output?

For now, I'm going to proceed simply assuming that files are encoded correctly and only use chardet as a fallback in the event that an exception is raised. Is there a better way to handle this to improve character set detection, or is this about as good as I can expect because the sample is so small?

user_none · 02-12-2013, 08:58 PM

There is no good way to reliably detect a file's character encoding. Your best bet is to decode, and check manually or if you know the encoding specify it explicitly.

In your case you could check for characters like Ă and ¨ in the text and use that as a trigger that the encoding was wrong. However, these are valid characters for that encoding. So this technique will only work in cases where you know those characters will not be present in the text. If this is a novel in a specific language this would work the majority of the time. But it is not a fool proof system and it is not a good general purpose method.

02-12-2013, 07:50 PM	#1
jgoguen Generally Awesome Person Posts: 1,100 Karma: 2191133 Join Date: Jan 2013 Location: /dev/kmem Device: Kobo Clara HD, Kindle Oasis	Character detection issues I'm trying to find a way to detect the proper encoding of a file. In most cases, it's correct but there's a few cases with short files where it isn't so good. I know that short files generally yield bad results, but I'm hoping there's something I'm doing wrong or that could be improved. In my plugin, I'm using the Container's get_raw function to get the raw data for files. The file I'm running into trouble with (I'm sure there's others I haven't noticed yet) has these contents: Spoiler: <?xml version="1.0" encoding="utf-8" standalone="no"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <link href="../Styles/style001.css" rel="stylesheet" type="text/css" /> <title></title> </head> <body> <h1 id="heading_id_1">Ringraziamenti</h1> <p class="norm">Con amore e tantissimi grazie a mia madre, Claudia, e a mia sorella Jemima, per l'aiuto e il sostegno. Infinita riconoscenza a tutti coloro che mi hanno fornito riscontri e consigli, in particolare Scott Bicheno, Max Schaefer, Simon Kavanagh e Oliver Cheetham.</p> <p class="norm">Profondo amore e gratitudine a Emma Bircham, ancora e per sempre.</p> <p class="norm">Grazie a tutti quelli della Macmillan, soprattutto al mio editor Peter Lavery per l'incredibile appoggio. E immensa gratitudine a Mic Cheetham, che mi ha aiutato più di quanto sappia esprimere.</p> <p class="norm">Non ho spazio sufficiente a ringraziare tutti gli scrittori che hanno esercitato una grande influenza su di me, ma voglio menzionarne due la cui opera è costante fonte di ispirazione e stupore. Dunque, a M. John Harrison, e alla memoria di Mervyn Peake, la mia umile e sentita riconoscenza.</p> <p class="norm">Senza di loro non avrei mai potuto scrivere questo libro.</p> <p class="norm"> </p> </body> </html> The chardet library detects this as ISO-8859-2 with confidence 85.5%, but decoding as ISO-8859-2 gives this: Spoiler: ?xml version="1.0" encoding="utf-8" standalone="no"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <link href="../Styles/style001.css" rel="stylesheet" type="text/css" /> <title></title> </head> <body> <h1 id="heading_id_1">Ringraziamenti</h1> <p class="norm">Con amore e tantissimi grazie a mia madre, Claudia, e a mia sorella Jemima, per l'aiuto e il sostegno. Infinita riconoscenza a tutti coloro che mi hanno fornito riscontri e consigli, in particolare Scott Bicheno, Max Schaefer, Simon Kavanagh e Oliver Cheetham.</p> <p class="norm">Profondo amore e gratitudine a Emma Bircham, ancora e per sempre.</p> <p class="norm">Grazie a tutti quelli della Macmillan, soprattutto al mio editor Peter Lavery per l'incredibile appoggio. E immensa gratitudine a Mic Cheetham, che mi ha aiutato piĂš di quanto sappia esprimere.</p> <p class="norm">Non ho spazio sufficiente a ringraziare tutti gli scrittori che hanno esercitato una grande influenza su di me, ma voglio menzionarne due la cui opera Ă¨ costante fonte di ispirazione e stupore. Dunque, a M. John Harrison, e alla memoria di Mervyn Peake, la mia umile e sentita riconoscenza.</p> <p class="norm">Senza di loro non avrei mai potuto scrivere questo libro.</p> <p class="norm"> </p> </body> </html> Obviously this is wrong, the characters are corrupted, but how (or can) I tell when a "successful" decode produces "bad" output? For now, I'm going to proceed simply assuming that files are encoded correctly and only use chardet as a fallback in the event that an exception is raised. Is there a better way to handle this to improve character set detection, or is this about as good as I can expect because the sample is so small?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Formatting folders...character cut-off issues?	trianglekitty	Library Management	0	07-30-2012 09:11 PM
Touch [non]deterministic font name detection ... and minor .txt issues	msoltyspl	Kobo Reader	6	03-28-2012 09:54 AM
Chapter detection and pagebreak issues	ilovejedd	Conversion	4	03-03-2011 01:39 PM
Device detection?	totanus	ePub	1	12-17-2009 08:05 AM
Structure detection v5.5 and v6.2	AlexBell	Calibre	2	07-29-2009 11:11 PM

02-12-2013, 08:58 PM	#2
user_none Sigil & calibre developer Posts: 2,487 Karma: 1063785 Join Date: Jan 2009 Location: Florida, USA Device: Nook STR	There is no good way to reliably detect a file's character encoding. Your best bet is to decode, and check manually or if you know the encoding specify it explicitly. In your case you could check for characters like Ă and ¨ in the text and use that as a trigger that the encoding was wrong. However, these are valid characters for that encoding. So this technique will only work in cases where you know those characters will not be present in the text. If this is a novel in a specific language this would work the majority of the time. But it is not a fool proof system and it is not a good general purpose method.

Advert