Certain hyphens being removed on HTML to ePub
Hi,
I've been converting HMTL files to ePub using Calibre, and then transferring them to my iPod Touch to read on Stanza. The problem is, certain dashes are being removed. Hyphenated words seem to make it through ok such as "one-hundred" but sentences where there is a long dash, breaking up the sentence, such as: "When you recover - and there is no "if"; you wouldn't be there if they didn't know they could fix you - you're still in the army" are being removed.
If I take the original HTML file and use Stanza's desktop converter and convert to epub, all of the dashes survive the transferal.
I sent the epub file that Calibre created to Lexcycle, and this was there response:
"I don't see the dashes when I open the xhtml document in Safari. Since Stanza uses the same renderer as Safari, that's a good browser to preview how documents will look in Stanza.
The problem is that your dashes are being represented by decimal 151, but you have declared that your document's encoding is UTF-8. 151 is em dash only for the windows-1252 (i.e. "latin1") encoding. You could fix this by using the proper UTF-8 encoding for the em dash (decimal 8212). But the easiest solution would be to just represent it using the HTML entity encoding of "—", which will allow you to bypass any character encoding issues altogether."
Hopefully that will help with fixing the conversion issue? I don't know...
I've also attached the original HTML file for analysis...
|