Typos in ebooks - Page 14

raac · 05-20-2011, 09:02 AM

Yes, I think it does have something to do with Sony not supporting the unicode used in the book. It translates some symbols correctly but not all of them. The book looks ok on my computer's screen. It is for these reasons that I listed that point as number four. The lack of images and the issues with their references is more serious. Still, if there is a unicode compatibility issue I don't see why publishers can't use the ASCII character codes so we don't have this problem.

EDIT:
I should say that the images definitely aren't there because I've exploded the e-pub and looked for them. In the past I've discovered missing images this way. I too am interested in what Penguin will say...

Jellby · 05-20-2011, 01:26 PM

Quote:

Originally Posted by raac

Still, if there is a unicode compatibility issue I don't see why publishers can't use the ASCII character codes so we don't have this problem.

The problem is not the encoding of the book, it is the lack of an appropriate glyph in whatever font is used.

What do you mean with "ASCII character codes"? I guess it's one of these:

1) Use named or numerical entities instead of the Unicode character. For instance, instead of "Peña" write "Peña". This does not solve anything, no matter whether you use "ñ" or "ñ", the font does not have the character, and it shows a box or a question mark instead.

2) "Downgrade" to some similar character that is in the ASCII set, removing diacritics etc. For instance, instead of "Peña" write "Pena". This risks to be completely wrong and misleading, "peña" means rock, boulder, while "pena" means pain, sorrow. There's a reason why diacritics exist in most languages.

raac · 05-20-2011, 02:35 PM

Quote:

Originally Posted by Jellby

T
1) Use named or numerical entities instead of the Unicode character. For instance, instead of "Peña" write "Peña". This does not solve anything, no matter whether you use "ñ" or "ñ", the font does not have the character, and it shows a box or a question mark instead.

That's what I meant, yes. I suppose you're right it wouldn't help. If the book is packaged with a suitable font that contains that character then shouldn't everything be good?

Jellby · 05-20-2011, 02:58 PM

Quote:

Originally Posted by raac

If the book is packaged with a suitable font that contains that character then shouldn't everything be good?

Yes, assuming that the reading software supports embedded fonts. But that would force a particular font on the user on readers that would otherwise support selecting a custom font (e.g., the Cybooks).

Really, the culprit here is Sony (for not allowing custom fonts on their readers) and Adobe (for not providing a better Unicode coverage in the default font). There's no excuse for any of them.

bizzybody · 05-20-2011, 05:49 PM

Here's a program that can read in a UTF-8 encoded HTML file and replace the UTF-8 HTML codes with the exact extended ASCII equivalent. https://www.mobileread.com/forums/sho...d.php?t=109996

It's not just for that, it can be used to process any text file and swap any specific string(s) with other text string(s). It's written in C# and needs a bit more debugging because if the replacement list is too long it does things it should not do.

As is, it can handle enough to swap the most common accented characters used in English, as well as the punctuation characters. Debugged to handle any length swap list, it could be a very useful text file manipulation tool. It's already faster than any word processor or text editor for doing huge numbers of replacements.

With a full character set swap file (which it currently can't handle) one could use it for one time pad cipher codes.

Could even run a file through several swaps to swap words for code words then totally scramble all the letters. The receiving person would need correctly formatted swap lists, used in the right order, to unscramble and decode.

WTH use UTF-8 for punctuation when ASCII and ordinary character encodings for Windows and other systems have characters like left and right quotes that produce exactly the same visible result? Unicode for standard characters when there's no need is text-bloat.

Replacing a couple thousand left and right unicode double quote marks with the left and right ASCII versions can reduce the file size quite a bit! A UTF-8 code is up to 7 characters, if leading zeroes are used. &#nnnn; One could write a whole text file that way but it'd be six times larger than using plain characters.

Another method that mostly works on HTML source files is to Save As Filtered HTML from Microsoft Word, but that can introduce its own issues with Microsoft's 'additions'.

Jellby · 05-21-2011, 03:31 AM

Quote:

Originally Posted by bizzybody

Here's a program that can read in a UTF-8 encoded HTML file and replace the UTF-8 HTML codes with the exact extended ASCII equivalent. https://www.mobileread.com/forums/sho...d.php?t=109996

I use recode, which is very easy:

Code:

recode utf8..html file.html

Quote:

WTH use UTF-8 for punctuation when ASCII and ordinary character encodings for Windows and other systems have characters like left and right quotes that produce exactly the same visible result? Unicode for standard characters when there's no need is text-bloat.

Replacing a couple thousand left and right unicode double quote marks with the left and right ASCII versions can reduce the file size quite a bit! A UTF-8 code is up to 7 characters, if leading zeroes are used. &#nnnn; One could write a whole text file that way but it'd be six times larger than using plain characters.

I think you are inverting the terms. Real ASCII has only 128 characters, everything else must be represented through (named or numerical) entities. ’ and & #8217; are "ASCII representations" in this discussion, as they use ASCII characters to represent another character that is not in the ASCII set. This is where text bloat is possible.

Using Unicode characters means using some Unicode encoding to represent the character directly, not through entities like above, so I can just write "é" or "ñ". These, in UTF-8, take at most 4 bytes, and typically 2 bytes (for Latin, Cyrillic or Greek scripts) or 3 bytes (for some punctuation).

But anyway, in ePUB all files are compressed, so the "bloat" introduced by the entities will be largely cancelled (since they are repetitive sequences, they can be more efficiently compressed).

raac · 05-23-2011, 11:07 AM

Penguin have so far only sent me a stock reply, saying that they have forwarded my message on to the appropriate department and may contact me again. We'll see what happens...

Michael J Hunt · 05-24-2011, 09:11 AM

I'm not a Kindle user, but I was surprised (shocked, dismayed) to see a full-page Kindle advert on the back cover of the Radio Times (a high profile weekly magazine in the UK) that displayed a page from 'Ordinary Thunderstorms', where the em-dash, or even the shorter en-dash, has been superceded by a hyphen. At first I thought 'river-all' was some obscure feature of a river, until, in the same sentence, I came to 'no doubt-but let's wait'. The next paragraph starts with, 'There he is-look-stepping hesitantly down from a taxi'.

I found this so distracting, I couldn't read on - even though it was only a single-page advert. There is no way that I would buy a Kindle if all their books are edited in this way.

Am I alone in finding this disturbing? Or is it common practice in e-readers, which regular customers accept without complaint?

DreamWriter · 05-24-2011, 10:54 AM

I just downloaded a sample of Ordinary Thunderstorms so I could see what you are talking about. Actually, those aren't hyphens. There are en dashes where there should be the longer em dashes. (If you still have the advertisement, compare the en dashes you referred to with the hyphens in "pale-faced" and "even-featured" if they show there.)

I find it very difficult to read that way too. I'm not sure why the publisher did that. It's very easy to code in em dashes. It was certainly a very poor example for Amazon to use in their Kindle advert.

I have to say that I have not seen that in a Kindle ebook before. I've usually seen the proper em dash used, two hyphens together, or space-hyphen-space.

Edited to add: When I created my husband's ebook, I did use the proper em dash. But there is a drawback, on the Kindle anyway. Kindle attempts to justify text, but it cannot hyphenate. Text is reflowable, so a publisher cannot control this either. If a line break occurs at an em dash (or an en dash), the Kindle cannot break it right after the dash, as you would see in print. Instead, it treats the word-em dash-word as a block and carries it all to the next line. This can leave a very unsightly space at the end, where the line broke. There's nothing that can be done about that. That's one reason why some people use space-hyphen-space instead of em dash in ebooks. (And others probably don't know how to create the em dash.)

This doesn't explain why the publisher used the en dash instead of the em dash in the book you cited, but I wanted to point out that there are some related difficulties with ebook formatting.

SeaBookGuy · 05-24-2011, 11:25 AM

Speaking of em dashes -- my last ebook had those instead of a final ess-apostrophe, so I was faced with things like " ... my parents -- car, the neighbors -- children" etc.

DreamWriter · 05-24-2011, 11:31 AM

Quote:

Originally Posted by SeaBookGuy

Speaking of em dashes -- my last ebook had those instead of a final ess-apostrophe, so I was faced with things like " ... my parents -- car, the neighbors -- children" etc.

Ew, that's awful! I can't think of any reason why they did that.

bizzybody · 05-25-2011, 12:15 AM

The attached file is a text file with the UTF-8 codes and their extended ASCII or Windows-1252 equivalents. (Or ISO 8859-1.) Note that the non-breaking space has the HTML "friendly" code because that's a non-printable character, also non-type-able without using the Alt+nnn code. The HTML code works with any book conversion software I've used.

Any Unicode supporting system should *not* need any of these characters' Unicode versions or UTF-8 codes in order to properly display them.

In fonts like Terminal, or the ANSI set (which Terminal is a monospaced TrueType clone of), some of the characters are different, but you won't encounter that on PDAs or book readers.

If you want your book to reach the widest possible audience, without getting questions about why there's all those weird characters or boxes or why the punctuation is all missing or replaced with nothing and the words jammed together... use the normal characters on this list instead of their Unicode versions, or in HTML their UTF-8 codes.

If the language you're using in your book has characters not in this list, then it's extremely likely the people reading it will have a device that supports Unicode or some other method of displaying those characters.

The main reason for all these issues with character encoding is America's fault. Since the vast majority of personal computers are still based on Ye Olde IBM PC, which was originally designed by Americans for English speakers, support for "foreign" characters was pretty much an afterthought for MS-DOS and PC-DOS. A similar problem was built into the early Internet (which is *not* the World Wide Web), which in its early years was all American. All the characters required for English could be encoded using 7-bit words, so that's how it was done, leaving the one bit always assumed to be zero unless commands were sent to specifically initiate a binary file transfer.

Remember that even mainframe computers 30+ years ago had memory measured in kilobytes. A system with a whole megabyte of RAM had a gigantic amount of memory to play with.

That's why the BinHex encoding format was created for sending Macintosh files across the internet. Many of the early routing systems were set to ignore the leftmost bit so that all outgoing traffic had that bit set to zero, no matter what it had been when it came in. BinHex uses only 7-bit text characters, thus it would survive transits through 7-bit routers. The MacBinary format used 8-bit text characters and was up to 1/8th more compact, which was a big savings when a 3600 baud modem was "screaming fast" and there was no such thing as unlimited data accounts.

So when you see weird junk in your books, first blame the English-centric American pioneers of the micro computer and the Internet, then blame the people at the company who made your reading device for not getting on the Unicode bandwagon from the start.

In other words, there's really no excuse for Palm OS (or any other PDA or book reader) to not have Unicode support, since the first standard for it was completed circa 1990~91 and the first Palm didn't go on sale until 1996!

Michael J Hunt · 05-25-2011, 11:05 AM

Hi Dream Writer. So it isn't just me - I'm relieved to hear it. What I find unbelievable, is that a company like Amazon didn't spot that for themselves when they agreed the advert. Talk about compounding an error.

One thing you mentioned that I'd like to pick up on, is where you state 'Some people probably don't know how to create the em-dash'. You can count me in on that - I assume you're referring to MicroSoft Word. How I do it, is copy an em-dash from the text and then paste it where I want it. Alternatively, I pick one out of 'Symbols', then copy it for further use. Cumbersome, I know, but it's far better than having hundreds of en-dashes to convert during editing.

If you know how to activate a consistent em-dash in Word, I'd be delighted if you could let me in on the trick.

pdurrant · 05-25-2011, 01:07 PM

Quote:

Originally Posted by Michael J Hunt

HIf you know how to activate a consistent em-dash in Word, I'd be delighted if you could let me in on the trick.

On Macintosh, en-dash is alt/- (–) and em-dash is alt/shift/- (—)

On Windows you probably need to do something complicated with the numeric keypad. (Checked: Probably Alt+0150 for en-dash and Alt+0151 for em-dash)

http://en.wikipedia.org/wiki/Dash

DiapDealer · 05-25-2011, 01:39 PM

Those are the correct alt codes for the different dashes on Windows.

If you don't feel like memorizing alt codes (or writing them down) just bring up the Windows Character Map utility (Programs->Accessories->System Tools). It will allow you to select and copy any of the special (or unicode) characters so you can paste them into documents.

05-20-2011, 09:02 AM	#196
raac Zealot Posts: 111 Karma: 1003802 Join Date: Jan 2010 Location: NY Device: Sony PRS-950	Yes, I think it does have something to do with Sony not supporting the unicode used in the book. It translates some symbols correctly but not all of them. The book looks ok on my computer's screen. It is for these reasons that I listed that point as number four. The lack of images and the issues with their references is more serious. Still, if there is a unicode compatibility issue I don't see why publishers can't use the ASCII character codes so we don't have this problem. EDIT: I should say that the images definitely aren't there because I've exploded the e-pub and looked for them. In the past I've discovered missing images this way. I too am interested in what Penguin will say... Last edited by raac; 05-20-2011 at 09:14 AM.

05-24-2011, 10:54 AM	#204
DreamWriter Books are brain food. Posts: 2,950 Karma: 4836916 Join Date: Nov 2010 Location: U.S. Device: Paperwhite · Fire HD6/HD8/HD10 · Galaxy Tab A7	I just downloaded a sample of Ordinary Thunderstorms so I could see what you are talking about. Actually, those aren't hyphens. There are en dashes where there should be the longer em dashes. (If you still have the advertisement, compare the en dashes you referred to with the hyphens in "pale-faced" and "even-featured" if they show there.) I find it very difficult to read that way too. I'm not sure why the publisher did that. It's very easy to code in em dashes. It was certainly a very poor example for Amazon to use in their Kindle advert. I have to say that I have not seen that in a Kindle ebook before. I've usually seen the proper em dash used, two hyphens together, or space-hyphen-space. Edited to add: When I created my husband's ebook, I did use the proper em dash. But there is a drawback, on the Kindle anyway. Kindle attempts to justify text, but it cannot hyphenate. Text is reflowable, so a publisher cannot control this either. If a line break occurs at an em dash (or an en dash), the Kindle cannot break it right after the dash, as you would see in print. Instead, it treats the word-em dash-word as a block and carries it all to the next line. This can leave a very unsightly space at the end, where the line broke. There's nothing that can be done about that. That's one reason why some people use space-hyphen-space instead of em dash in ebooks. (And others probably don't know how to create the em dash.) This doesn't explain why the publisher used the en dash instead of the em dash in the book you cited, but I wanted to point out that there are some related difficulties with ebook formatting. Last edited by DreamWriter; 05-24-2011 at 11:23 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Looking for examples of typos in eBooks	Tonycole	General Discussions	1	05-05-2010 04:23 AM
typos or mistakes in ebooks	delcimai	Sony Reader	15	02-14-2010 11:53 AM
Typos during conversion	ddavtian	Calibre	11	10-20-2008 12:57 AM
eBooks and Typos	seldan	Reading and Management	9	10-08-2007 12:35 PM
ebook typos	sugarbear2403	Sony Reader	6	10-09-2006 11:47 PM

05-20-2011, 05:49 PM	#200
bizzybody Addict Posts: 286 Karma: 7742186 Join Date: Apr 2007 Location: Idaho, USA Device: Various PalmOS PDAs, Android Phones, Sharper Image Literati	Here's a program that can read in a UTF-8 encoded HTML file and replace the UTF-8 HTML codes with the exact extended ASCII equivalent. https://www.mobileread.com/forums/sho...d.php?t=109996 It's not just for that, it can be used to process any text file and swap any specific string(s) with other text string(s). It's written in C# and needs a bit more debugging because if the replacement list is too long it does things it should not do. As is, it can handle enough to swap the most common accented characters used in English, as well as the punctuation characters. Debugged to handle any length swap list, it could be a very useful text file manipulation tool. It's already faster than any word processor or text editor for doing huge numbers of replacements. With a full character set swap file (which it currently can't handle) one could use it for one time pad cipher codes. Could even run a file through several swaps to swap words for code words then totally scramble all the letters. The receiving person would need correctly formatted swap lists, used in the right order, to unscramble and decode. WTH use UTF-8 for punctuation when ASCII and ordinary character encodings for Windows and other systems have characters like left and right quotes that produce exactly the same visible result? Unicode for standard characters when there's no need is text-bloat. Replacing a couple thousand left and right unicode double quote marks with the left and right ASCII versions can reduce the file size quite a bit! A UTF-8 code is up to 7 characters, if leading zeroes are used. &#nnnn; One could write a whole text file that way but it'd be six times larger than using plain characters. Another method that mostly works on HTML source files is to Save As Filtered HTML from Microsoft Word, but that can introduce its own issues with Microsoft's 'additions'.

05-23-2011, 11:07 AM	#202
raac Zealot Posts: 111 Karma: 1003802 Join Date: Jan 2010 Location: NY Device: Sony PRS-950	Penguin have so far only sent me a stock reply, saying that they have forwarded my message on to the appropriate department and may contact me again. We'll see what happens...

05-24-2011, 09:11 AM	#203
Michael J Hunt Enthusiast Posts: 38 Karma: 50000 Join Date: Mar 2010 Location: Lancashire, England Device: none	I'm not a Kindle user, but I was surprised (shocked, dismayed) to see a full-page Kindle advert on the back cover of the Radio Times (a high profile weekly magazine in the UK) that displayed a page from 'Ordinary Thunderstorms', where the em-dash, or even the shorter en-dash, has been superceded by a hyphen. At first I thought 'river-all' was some obscure feature of a river, until, in the same sentence, I came to 'no doubt-but let's wait'. The next paragraph starts with, 'There he is-look-stepping hesitantly down from a taxi'. I found this so distracting, I couldn't read on - even though it was only a single-page advert. There is no way that I would buy a Kindle if all their books are edited in this way. Am I alone in finding this disturbing? Or is it common practice in e-readers, which regular customers accept without complaint?

05-24-2011, 11:25 AM	#205
SeaBookGuy Can one read too much? Posts: 2,015 Karma: 2487799 Join Date: Aug 2010 Location: Naples, FL Device: Kindle PW 3, Sony 350 and 650	Speaking of em dashes -- my last ebook had those instead of a final ess-apostrophe, so I was faced with things like " ... my parents -- car, the neighbors -- children" etc.

05-25-2011, 11:05 AM	#208
Michael J Hunt Enthusiast Posts: 38 Karma: 50000 Join Date: Mar 2010 Location: Lancashire, England Device: none	Hi Dream Writer. So it isn't just me - I'm relieved to hear it. What I find unbelievable, is that a company like Amazon didn't spot that for themselves when they agreed the advert. Talk about compounding an error. One thing you mentioned that I'd like to pick up on, is where you state 'Some people probably don't know how to create the em-dash'. You can count me in on that - I assume you're referring to MicroSoft Word. How I do it, is copy an em-dash from the text and then paste it where I want it. Alternatively, I pick one out of 'Symbols', then copy it for further use. Cumbersome, I know, but it's far better than having hundreds of en-dashes to convert during editing. If you know how to activate a consistent em-dash in Word, I'd be delighted if you could let me in on the trick.

05-25-2011, 01:39 PM	#210
DiapDealer Grand Sorcerer Posts: 27,552 Karma: 193191846 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	Those are the correct alt codes for the different dashes on Windows. If you don't feel like memorizing alt codes (or writing them down) just bring up the Windows Character Map utility (Programs->Accessories->System Tools). It will allow you to select and copy any of the special (or unicode) characters so you can paste them into documents.

Advert

Advert