01-06-2011, 02:36 AM | #1 |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
<TT>–</TT>
same regex book as in another thread, but this question got buried at end of thread, so I an reposting it for help:
- I am losing the hyphen from this line "using the – metacharacter" that is a copy + paste from the CHM source - but no matter how I do the conversion to epub or to mobi, i end up with this when I view the output: "using the metacharacter" I've tried ticking transliterate, tried cp1252 encoding .... using view source on the chm I see this Code:
using the <TT>–</TT> metacharacter but if I convert to mobi with same settings and send to Kindle then I see a question mark inside a box character, where the dash should be ! how do I get the line to convert correctly into epub ? PS - what is even more puzzling is that elsewhere in the book, what seems to be the same html DOES convert OK - e.g. this line converted OK into epub. - (hyphen) is a special metacharacter source code followed by epub code - all correct Code:
class="docText"><TT>-</TT> (hyphen) is a special metacharacter Code:
<p class="docText1"><tt class="calibre13">-</tt> (hyphen) Last edited by cybmole; 01-06-2011 at 03:01 AM. |
01-06-2011, 04:54 AM | #2 |
Guru
Posts: 695
Karma: 822675
Join Date: May 2010
Device: Kobo Aura, Nokia Lumia 920 (Freda)
|
Calibre doesn't like soft hyphens and tends to strip them out. If you can edit the chm source, try changing the character that's being used to a different type of hyphen.
|
01-06-2011, 04:59 AM | #3 | |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
Quote:
|
|
01-06-2011, 05:31 AM | #4 |
Guru
Posts: 695
Karma: 822675
Join Date: May 2010
Device: Kobo Aura, Nokia Lumia 920 (Freda)
|
No idea, as I've never had to do that. Google shows some options. You could also try using Calibre's debug output (on the Conversion dialog, choose the Debug section on the left and give it a path). That will save the intermediate output steps that Calibre goes through during conversion. The resulting HTML might not yet have had the soft hyphens removed, in which case you could take a copy of the HTML output, edit it appropriately, and use that as input for an epub conversion.
|
01-06-2011, 06:15 AM | #5 |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
well i can patch it up manually with sigil. i think it is just 1 occurence in 1 book.
i was hoping to learn how to auto-fix it but that is looking unlikely . the epub conversion is much easier to read/scroll through on pc than the original .chm any idea why calibre strips out this soft hyphen ( if that is what it is) only on convert to epub - & not on convert to mobi ? |
01-06-2011, 11:34 AM | #6 |
creator of calibre
Posts: 43,896
Karma: 22666668
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
because many readers do not handle soft hyphens correctly.
|
01-06-2011, 01:33 PM | #7 |
Groupie
Posts: 155
Karma: 200000
Join Date: Dec 2009
Location: Britania
Device: Android
|
huh? Why would this be a soft hyphen?
Soft hyphens are used to indicate possible hyphenation points within a word, e.g. count-ing. The idea is they'll only be rendered if the 'reader has to break the word at that point. If this was a soft hyphen, there'd be no reason to expect it to display at all, because its not inside a word! Python says that the characters in this thread are not soft hyphens: >>> import unicodedata >>> unicodedata.name(u"-") 'HYPHEN-MINUS' >>> unicodedata.name(u"–") 'EN DASH' Maybe something mangled it on the way, but I can't imagine why it would use an en-dash instead of a normal hyphen. |
01-06-2011, 04:45 PM | #8 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
You're right - the character you initially posted is an en-dash, not a soft-hyphen. That said, PHPBB may be doing something as well - you should double-check the hyphen/dash displayed in the original source doc. Calibre often strips soft-hyphens during conversion, but en-dashes should be preserved.
Wikipedia article to explain which type to use when: http://en.wikipedia.org/wiki/Dash |
01-07-2011, 07:41 AM | #9 |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
what i posted was obtained by view source ( a right click option within the displayed CHM page ) - copy - paste to thread
so should be an accurate reproduction of what is in the .chm source. I buy in to the idea that it is actually an en-dash. it looks like one and it explains what I see on Kindle - Kindle does not do en-dash so it does the quesionmark in a box substitution. that leaves us with the question of why calibre chm to epub conversion is discarding an en-dash ?. I don't know how to build a simple test .chm. maybe its a bug which is specific to .chm sources. if it occurred with, say, html source, it would surely have been noticed and reported on already. |