Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Development

Notices

Reply
 
Thread Tools Search this Thread
Old 02-12-2013, 06:50 PM   #1
jgoguen
Generally Awesome Person
jgoguen ought to be getting tired of karma fortunes by now.jgoguen ought to be getting tired of karma fortunes by now.jgoguen ought to be getting tired of karma fortunes by now.jgoguen ought to be getting tired of karma fortunes by now.jgoguen ought to be getting tired of karma fortunes by now.jgoguen ought to be getting tired of karma fortunes by now.jgoguen ought to be getting tired of karma fortunes by now.jgoguen ought to be getting tired of karma fortunes by now.jgoguen ought to be getting tired of karma fortunes by now.jgoguen ought to be getting tired of karma fortunes by now.jgoguen ought to be getting tired of karma fortunes by now.
 
Posts: 1,061
Karma: 2178845
Join Date: Jan 2013
Location: /dev/kmem
Device: Kobo Clara HD, Kindle Oasis
Character detection issues

I'm trying to find a way to detect the proper encoding of a file. In most cases, it's correct but there's a few cases with short files where it isn't so good. I know that short files generally yield bad results, but I'm hoping there's something I'm doing wrong or that could be improved.

In my plugin, I'm using the Container's get_raw function to get the raw data for files. The file I'm running into trouble with (I'm sure there's others I haven't noticed yet) has these contents:
Spoiler:
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<link href="../Styles/style001.css" rel="stylesheet" type="text/css" />
<title></title>
</head>
<body>
<h1 id="heading_id_1">Ringraziamenti</h1>
<p class="norm">Con amore e tantissimi grazie a mia madre, Claudia, e a mia sorella Jemima, per l'aiuto e il sostegno. Infinita riconoscenza a tutti coloro che mi hanno fornito riscontri e consigli, in particolare Scott Bicheno, Max Schaefer, Simon Kavanagh e Oliver Cheetham.</p>
<p class="norm">Profondo amore e gratitudine a Emma Bircham, ancora e per sempre.</p>
<p class="norm">Grazie a tutti quelli della Macmillan, soprattutto al mio editor Peter Lavery per l'incredibile appoggio. E immensa gratitudine a Mic Cheetham, che mi ha aiutato più di quanto sappia esprimere.</p>
<p class="norm">Non ho spazio sufficiente a ringraziare tutti gli scrittori che hanno esercitato una grande influenza su di me, ma voglio menzionarne due la cui opera è costante fonte di ispirazione e stupore. Dunque, a M. John Harrison, e alla memoria di Mervyn Peake, la mia umile e sentita riconoscenza.</p>
<p class="norm">Senza di loro non avrei mai potuto scrivere questo libro.</p>
<p class="norm">&nbsp;</p>
</body>
</html>


The chardet library detects this as ISO-8859-2 with confidence 85.5%, but decoding as ISO-8859-2 gives this:
Spoiler:
?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<link href="../Styles/style001.css" rel="stylesheet" type="text/css" />
<title></title>
</head>
<body>
<h1 id="heading_id_1">Ringraziamenti</h1>
<p class="norm">Con amore e tantissimi grazie a mia madre, Claudia, e a mia sorella Jemima, per l'aiuto e il sostegno. Infinita riconoscenza a tutti coloro che mi hanno fornito riscontri e consigli, in particolare Scott Bicheno, Max Schaefer, Simon Kavanagh e Oliver Cheetham.</p>
<p class="norm">Profondo amore e gratitudine a Emma Bircham, ancora e per sempre.</p>
<p class="norm">Grazie a tutti quelli della Macmillan, soprattutto al mio editor Peter Lavery per l'incredibile appoggio. E immensa gratitudine a Mic Cheetham, che mi ha aiutato piĂš di quanto sappia esprimere.</p>
<p class="norm">Non ho spazio sufficiente a ringraziare tutti gli scrittori che hanno esercitato una grande influenza su di me, ma voglio menzionarne due la cui opera è costante fonte di ispirazione e stupore. Dunque, a M. John Harrison, e alla memoria di Mervyn Peake, la mia umile e sentita riconoscenza.</p>
<p class="norm">Senza di loro non avrei mai potuto scrivere questo libro.</p>
<p class="norm">&nbsp;</p>
</body>
</html>


Obviously this is wrong, the characters are corrupted, but how (or can) I tell when a "successful" decode produces "bad" output?

For now, I'm going to proceed simply assuming that files are encoded correctly and only use chardet as a fallback in the event that an exception is raised. Is there a better way to handle this to improve character set detection, or is this about as good as I can expect because the sample is so small?
jgoguen is offline   Reply With Quote
Old 02-12-2013, 07:58 PM   #2
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
There is no good way to reliably detect a file's character encoding. Your best bet is to decode, and check manually or if you know the encoding specify it explicitly.

In your case you could check for characters like Ă and ¨ in the text and use that as a trigger that the encoding was wrong. However, these are valid characters for that encoding. So this technique will only work in cases where you know those characters will not be present in the text. If this is a novel in a specific language this would work the majority of the time. But it is not a fool proof system and it is not a good general purpose method.
user_none is offline   Reply With Quote
Advert
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Formatting folders...character cut-off issues? trianglekitty Library Management 0 07-30-2012 08:11 PM
Touch [non]deterministic font name detection ... and minor .txt issues msoltyspl Kobo Reader 6 03-28-2012 08:54 AM
Chapter detection and pagebreak issues ilovejedd Conversion 4 03-03-2011 12:39 PM
Device detection? totanus ePub 1 12-17-2009 07:05 AM
Structure detection v5.5 and v6.2 AlexBell Calibre 2 07-29-2009 10:11 PM


All times are GMT -4. The time now is 12:23 PM.


MobileRead.com is a privately owned, operated and funded community.