08-03-2020, 08:19 AM | #1 |
Junior Member
Posts: 6
Karma: 10
Join Date: Aug 2020
Device: Kindle
|
ASCII or HTML
Hi,
I'm compiling my first ePub to submit to Kindle. I used the website, word to clean html dot com, which converted curly quotes to html. Later, in Sigil, I used the Mend & Prettify all HTML files tool, which stripped the HTML for punctuation and quotes. I was under the impression that HTML should be used for most, if not all such punctuation. And that, if it's not HTML, then such characters would be in ASCII format. I don't mean to question Sigil's methods but, will my ePub be ok to submit to Kindle like this? At a glance, the only HTML formatting is for paragraphs, headings... There is no html for any punctuation or 'special' characters (not that there are many special characters). Apologies for the newbie question. I did search the forum, and read the 'NotJohn Guide...' but if there's a clear answer I couldn't find it. Sincerely appreciate any clarification here! |
08-03-2020, 09:07 AM | #2 |
Sigil Developer
Posts: 7,506
Karma: 5433350
Join Date: Nov 2009
Device: many
|
A few things:
- Epub xhtml files support full unicode for text content. No restriction to ascii for text content at all - Many people use what are called numeric or named entities to encode special characters but this is not required. Things like smart quotes, non-breaking spaces etc. Unless you tell Sigil via its Preserve Entities preference settings to keep them, Sigil will simply use the actual unicode character. Nothing is lost. If you want the entities to be put back, just add the entities you want to use to Sigil's Preserve Entities list in Preferences and run Mend again. Note named entities are not permitted in epub3 which requires numeric entities if you decide to keep them. If "entities" are what you mean by special characters, there is no need to use them for Kindle or for epub but they do make some special white space chars more easy to see. |
Advert | |
|
08-03-2020, 10:09 AM | #3 | |
Junior Member
Posts: 6
Karma: 10
Join Date: Aug 2020
Device: Kindle
|
Thanks KevinH, much appreciated.
To correct my terminology (for own sake, and hopefully benefit others): Unicode is a superset of ASCII. I should have referred to Unicode instead of ASCII. ePub 2.0.1 requires Unicode UTF-8 or UTF-16.(Source) Yes, entity is what I should have said, not 'special character': "An HTML entity is a piece of text ("string") that begins with an ampersand (&) and ends with a semicolon ( . Entities are frequently used to display reserved characters (which would otherwise be interpreted as HTML code), and invisible characters (like non-breaking spaces). You can also use them in place of other characters that are difficult to type with a standard keyboard. " (Source) One thing I don't quite understand: Quote:
Named entities are HTML, and Numbered entities are Unicode, as listed here? Again, sincere thanks, much appreciated. |
|
08-03-2020, 10:50 AM | #4 |
Sigil Developer
Posts: 7,506
Karma: 5433350
Join Date: Nov 2009
Device: many
|
Named entities use a mnemonic character string instead of a numeric code. So the following is a named entity (ignore the spaces).
& n b s p ; The equivalent numeric entity can be written in hexadecimal or decimal notation as follows: & # 1 6 0 ; or & # X A 0 ; The only named entities allowed in html5/epub3 are the original xml entities. & a m p ; & l t ; & g t ; and a few others. All others must be in numeric form. |
08-03-2020, 11:19 AM | #5 |
Junior Member
Posts: 6
Karma: 10
Join Date: Aug 2020
Device: Kindle
|
Fantastic, many thanks.
I didn't realise HTML entities could be written in HEX. And I'll need to spend some time getting my head around the following: According to this website, possibly a useful resource, ePub aside (!), the choice for character codes: - Unicode - HTML Code - HTML Entity (so this is different to HTML code) - HEX code - CSS Code I've just realised that, the website word2cleanhtml.com has a box to tick where it says "Replace non-ascii with HTML entities", but it's actually replacing them with HTML code. & # 8220 ; Who knew!? I'm up and running anyway. Thanks so much for your responses. |
Advert | |
|
08-03-2020, 11:25 AM | #6 |
Sigil Developer
Posts: 7,506
Karma: 5433350
Join Date: Nov 2009
Device: many
|
Not quite. Their "html code" is actually a numeric entity and their "entity" is actually a "named entity". For html5 and therefore epub3, no named entities are allowed. Numeric entities are allowed but not needed.
Under epub2, both named and numeric entities are allowed. The file itself should be utf-8 encoded but Sigil handles that conversion for you in both ways. |
08-03-2020, 01:22 PM | #7 |
Guru
Posts: 878
Karma: 2457540
Join Date: Nov 2011
Device: none
|
Your code example looks OK. Are you saying special characters or formatting have been stripped out?
I suspect you might want to check your use of punctuation though. There's a couple of things there which just MIGHT be ok had you shown us the whole paragraph for context, but I'm afraid they probably aren't. |
08-03-2020, 05:58 PM | #8 |
Running with scissors
Posts: 1,552
Karma: 14325282
Join Date: Nov 2019
Device: none
|
I'm guessing by "stripped out" he means that the left curly quotes and right curly quotes were originally named (more likely is my guess) or numeric entities and Sigil's mend and prettify converted them to the Unicode characters. When he says punctuation he may mean dashes, ellipses, and such; I have Sigil preserve those so I can tell what's what.
|
08-04-2020, 05:48 AM | #9 |
Junior Member
Posts: 6
Karma: 10
Join Date: Aug 2020
Device: Kindle
|
Apologies, my saying Sigil 'stripped' the entities isn't fair onSigil.
To clarify : Sigil correctly converted the HTML numeric entities for curly quotes, and more, into standard (Unicode, I guess) characters. Word2cleanhtmal.com's "Replace non-ascii with HTML entities" option had previously converted curly brackets into HTML numeric entities. |
08-13-2020, 06:31 AM | #10 |
mostly an observer
Posts: 1,515
Karma: 987654
Join Date: Dec 2012
Device: Kindle
|
I've uploaded twenty or so books using the route you describe (Word to Word2Clean to Sigil) and never had a problem. (Yes, it was surprising when Sigil began rendering quotes as quotes, but neither on Amazon nor any other bookseller have I ever encountered a problem. Perhaps I should add that I use epub2.)
|
Tags |
ascii, html, quotes |
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Non-ASCII File Names | Hopkins | Editor | 5 | 01-18-2018 08:02 AM |
Image to ascii | crutledge | ePub | 9 | 10-29-2014 04:29 PM |
Calibre Recipe HTML content differs from raw html of index.html. | krunk | Calibre | 4 | 09-20-2010 09:48 PM |
Ascii file | ProDigit | Lounge | 1 | 12-25-2008 10:08 PM |
WM Live Video in ASCII! | TadW | Lounge | 1 | 06-22-2006 07:14 PM |