Thread: Typos in ebooks
View Single Post
Old 05-20-2011, 05:49 PM   #200
bizzybody
Addict
bizzybody ought to be getting tired of karma fortunes by now.bizzybody ought to be getting tired of karma fortunes by now.bizzybody ought to be getting tired of karma fortunes by now.bizzybody ought to be getting tired of karma fortunes by now.bizzybody ought to be getting tired of karma fortunes by now.bizzybody ought to be getting tired of karma fortunes by now.bizzybody ought to be getting tired of karma fortunes by now.bizzybody ought to be getting tired of karma fortunes by now.bizzybody ought to be getting tired of karma fortunes by now.bizzybody ought to be getting tired of karma fortunes by now.bizzybody ought to be getting tired of karma fortunes by now.
 
Posts: 296
Karma: 7742186
Join Date: Apr 2007
Location: Idaho, USA
Device: Various PalmOS PDAs, Android Phones, Sharper Image Literati
Here's a program that can read in a UTF-8 encoded HTML file and replace the UTF-8 HTML codes with the exact extended ASCII equivalent. https://www.mobileread.com/forums/sho...d.php?t=109996

It's not just for that, it can be used to process any text file and swap any specific string(s) with other text string(s). It's written in C# and needs a bit more debugging because if the replacement list is too long it does things it should not do.

As is, it can handle enough to swap the most common accented characters used in English, as well as the punctuation characters. Debugged to handle any length swap list, it could be a very useful text file manipulation tool. It's already faster than any word processor or text editor for doing huge numbers of replacements.

With a full character set swap file (which it currently can't handle) one could use it for one time pad cipher codes. Could even run a file through several swaps to swap words for code words then totally scramble all the letters. The receiving person would need correctly formatted swap lists, used in the right order, to unscramble and decode.

WTH use UTF-8 for punctuation when ASCII and ordinary character encodings for Windows and other systems have characters like left and right quotes that produce exactly the same visible result? Unicode for standard characters when there's no need is text-bloat.

Replacing a couple thousand left and right unicode double quote marks with the left and right ASCII versions can reduce the file size quite a bit! A UTF-8 code is up to 7 characters, if leading zeroes are used. &#nnnn; One could write a whole text file that way but it'd be six times larger than using plain characters.

Another method that mostly works on HTML source files is to Save As Filtered HTML from Microsoft Word, but that can introduce its own issues with Microsoft's 'additions'.
bizzybody is offline   Reply With Quote