MobileRead Forums - View Single Post

bizzybody · 05-27-2010, 06:15 AM

Typos. BAEN has them a-plenty.

I've read all of the free BAEN e-books they've released on the CD-ROMs included with some of their hardcovers. I'm pretty certain they've all had typos.

BAEN does eARCs (electronic Advance Reader Copies) and charges money for them. ARC buyers are supposed to do things like proofread and provide feedback to the author. Sometimes the eARC and the final version have significant changes from reader feedback, but usually not.

But still the typos persist.

One place where punctuation issues can creep in is when mixing Unicode and non-Unicode text encodings. For English text books, the Extended ASCII character set works just fine. It has all the punctuation required. The problem is some platforms (like Palm OS) don't natively support Unicode, and conversion programs don't have settings for converting from Unicode to E-ASCII punctuation characters.

Load an ASCII file into a program on a Unicode compliant platform and it'll display fine. go the other way (if it's possible without conversion) and depending on the reading software you may find all the Unicode characters replaced with boxes or empty spaces or *nothing*- with the text on either side jammed together, or the program will replace the Unicode character with the same numbered character from the systems native character set.

Do a conversion from a Unicode source to a format which doesn't support Unicode at all on any platform, and the foul-ups will appear on all platforms that file can be opened on.

One fix that usually takes care of this in HTML is saving as Filtered HTML in MS Word. That's the cleanest HTML one can get from MS Word.

Another is this little program. http://ratzmandious.110mb.com/files/UTFStripper.zip
Does exactly what it says on the tin. It takes a text file, looks for the UTF-8 codes that start with &# followed by three or four digits (the codes are all 4 digit but leading zeroes can be dropped) and replaces them with the ASCII equivalents.

What you're left with is technically still a UTF-8 Unicode compliant HTML file, but there's no Unicode in it and it will convert cleanly for platforms that don't have Unicode support. The file size will also shrink a bit due to replacing the 5 or 6 characters required to encode a single character with a real single character.

Were one so inclined, all the text of a UTF-8 encoded HTML file could be entered in those &#nnnn codes, but the file size would be about 5 times larger. Hmmm, that'd be somewhat simple to do a conversion program for, but it'd have to ignore HTML tags and whatever in the header must remain ASCII.

The problem is all the USA's fault for being the main originator of computer technology way back when, when early computer technical people never gave a thought to anyone using non-English languages on computers.

That, and the original non-extended ASCII character set only using 7 bits, that resulted in the need for the BinHex file encoding format for Macintosh files due to early internet router computers being programmed to assume all the data passing through was plain English text, so we'll just set that first bit to zero on every outgoing byte, m'kay... saves 1/8th on bandwidth at 110 or 300 characters per second.

On behalf of The USA, I humbly apologize for that.

05-27-2010, 06:15 AM	#115
bizzybody Addict Posts: 302 Karma: 8317682 Join Date: Apr 2007 Location: Idaho, USA Device: Various PalmOS PDAs, Android Phones, Sharper Image Literati	Typos. BAEN has them a-plenty. I've read all of the free BAEN e-books they've released on the CD-ROMs included with some of their hardcovers. I'm pretty certain they've all had typos. BAEN does eARCs (electronic Advance Reader Copies) and charges money for them. ARC buyers are supposed to do things like proofread and provide feedback to the author. Sometimes the eARC and the final version have significant changes from reader feedback, but usually not. But still the typos persist. One place where punctuation issues can creep in is when mixing Unicode and non-Unicode text encodings. For English text books, the Extended ASCII character set works just fine. It has all the punctuation required. The problem is some platforms (like Palm OS) don't natively support Unicode, and conversion programs don't have settings for converting from Unicode to E-ASCII punctuation characters. Load an ASCII file into a program on a Unicode compliant platform and it'll display fine. go the other way (if it's possible without conversion) and depending on the reading software you may find all the Unicode characters replaced with boxes or empty spaces or nothing- with the text on either side jammed together, or the program will replace the Unicode character with the same numbered character from the systems native character set. Do a conversion from a Unicode source to a format which doesn't support Unicode at all on any platform, and the foul-ups will appear on all platforms that file can be opened on. One fix that usually takes care of this in HTML is saving as Filtered HTML in MS Word. That's the cleanest HTML one can get from MS Word. Another is this little program. http://ratzmandious.110mb.com/files/UTFStripper.zip Does exactly what it says on the tin. It takes a text file, looks for the UTF-8 codes that start with &# followed by three or four digits (the codes are all 4 digit but leading zeroes can be dropped) and replaces them with the ASCII equivalents. What you're left with is technically still a UTF-8 Unicode compliant HTML file, but there's no Unicode in it and it will convert cleanly for platforms that don't have Unicode support. The file size will also shrink a bit due to replacing the 5 or 6 characters required to encode a single character with a real single character. Were one so inclined, all the text of a UTF-8 encoded HTML file could be entered in those &#nnnn codes, but the file size would be about 5 times larger. Hmmm, that'd be somewhat simple to do a conversion program for, but it'd have to ignore HTML tags and whatever in the header must remain ASCII. The problem is all the USA's fault for being the main originator of computer technology way back when, when early computer technical people never gave a thought to anyone using non-English languages on computers. That, and the original non-extended ASCII character set only using 7 bits, that resulted in the need for the BinHex file encoding format for Macintosh files due to early internet router computers being programmed to assume all the data passing through was plain English text, so we'll just set that first bit to zero on every outgoing byte, m'kay... saves 1/8th on bandwidth at 110 or 300 characters per second. On behalf of The USA, I humbly apologize for that.