|02-12-2013, 07:50 PM||#1|
Generally Awesome Person
Join Date: Jan 2013
Location: San Francisco Bay Area
Device: Kindle Paperwhite 2
Character detection issues
I'm trying to find a way to detect the proper encoding of a file. In most cases, it's correct but there's a few cases with short files where it isn't so good. I know that short files generally yield bad results, but I'm hoping there's something I'm doing wrong or that could be improved.
In my plugin, I'm using the Container's get_raw function to get the raw data for files. The file I'm running into trouble with (I'm sure there's others I haven't noticed yet) has these contents:
The chardet library detects this as ISO-8859-2 with confidence 85.5%, but decoding as ISO-8859-2 gives this:
Obviously this is wrong, the characters are corrupted, but how (or can) I tell when a "successful" decode produces "bad" output?
For now, I'm going to proceed simply assuming that files are encoded correctly and only use chardet as a fallback in the event that an exception is raised. Is there a better way to handle this to improve character set detection, or is this about as good as I can expect because the sample is so small?
|02-12-2013, 08:58 PM||#2|
Sigil & calibre developer
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
There is no good way to reliably detect a file's character encoding. Your best bet is to decode, and check manually or if you know the encoding specify it explicitly.
In your case you could check for characters like Ă and ¨ in the text and use that as a trigger that the encoding was wrong. However, these are valid characters for that encoding. So this technique will only work in cases where you know those characters will not be present in the text. If this is a novel in a specific language this would work the majority of the time. But it is not a fool proof system and it is not a good general purpose method.
|Thread Tools||Search this Thread|
|Thread||Thread Starter||Forum||Replies||Last Post|
|Formatting folders...character cut-off issues?||trianglekitty||Library Management||0||07-30-2012 09:11 PM|
|Touch [non]deterministic font name detection ... and minor .txt issues||msoltyspl||Kobo Reader||6||03-28-2012 09:54 AM|
|Chapter detection and pagebreak issues||ilovejedd||Conversion||4||03-03-2011 01:39 PM|
|Device detection?||totanus||ePub||1||12-17-2009 08:05 AM|
|Structure detection v5.5 and v6.2||AlexBell||Calibre||2||07-29-2009 11:11 PM|