|
|
#1 |
|
High Fantasy Bibliophage
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 160
Karma: 52726
Join Date: Jan 2013
Location: New Brunswick, Canada
Device: Kobo Glo
|
Character detection issues
In my plugin, I'm using the Container's get_raw function to get the raw data for files. The file I'm running into trouble with (I'm sure there's others I haven't noticed yet) has these contents: Spoiler:
The chardet library detects this as ISO-8859-2 with confidence 85.5%, but decoding as ISO-8859-2 gives this: Spoiler:
Obviously this is wrong, the characters are corrupted, but how (or can) I tell when a "successful" decode produces "bad" output? For now, I'm going to proceed simply assuming that files are encoded correctly and only use chardet as a fallback in the event that an exception is raised. Is there a better way to handle this to improve character set detection, or is this about as good as I can expect because the sample is so small? |
|
|
|
|
|
#2 |
|
Sigil & calibre developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,384
Karma: 848775
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
|
There is no good way to reliably detect a file's character encoding. Your best bet is to decode, and check manually or if you know the encoding specify it explicitly.
In your case you could check for characters like Ă and ¨ in the text and use that as a trigger that the encoding was wrong. However, these are valid characters for that encoding. So this technique will only work in cases where you know those characters will not be present in the text. If this is a novel in a specific language this would work the majority of the time. But it is not a fool proof system and it is not a good general purpose method. |
|
|
|
|
Enthusiast
|
|
![]() |
| Thread Tools | Search this Thread |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Formatting folders...character cut-off issues? | trianglekitty | Library Management | 0 | 07-30-2012 08:11 PM |
| Touch [non]deterministic font name detection ... and minor .txt issues | msoltyspl | Kobo Reader | 6 | 03-28-2012 08:54 AM |
| Chapter detection and pagebreak issues | ilovejedd | Conversion | 4 | 03-03-2011 12:39 PM |
| Device detection? | totanus | ePub | 1 | 12-17-2009 07:05 AM |
| Structure detection v5.5 and v6.2 | AlexBell | Calibre | 2 | 07-29-2009 10:11 PM |