MobileRead Forums - View Single Post

bizzybody · 05-20-2011, 05:49 PM

Here's a program that can read in a UTF-8 encoded HTML file and replace the UTF-8 HTML codes with the exact extended ASCII equivalent. https://www.mobileread.com/forums/sho...d.php?t=109996

It's not just for that, it can be used to process any text file and swap any specific string(s) with other text string(s). It's written in C# and needs a bit more debugging because if the replacement list is too long it does things it should not do.

As is, it can handle enough to swap the most common accented characters used in English, as well as the punctuation characters. Debugged to handle any length swap list, it could be a very useful text file manipulation tool. It's already faster than any word processor or text editor for doing huge numbers of replacements.

With a full character set swap file (which it currently can't handle) one could use it for one time pad cipher codes.

Could even run a file through several swaps to swap words for code words then totally scramble all the letters. The receiving person would need correctly formatted swap lists, used in the right order, to unscramble and decode.

WTH use UTF-8 for punctuation when ASCII and ordinary character encodings for Windows and other systems have characters like left and right quotes that produce exactly the same visible result? Unicode for standard characters when there's no need is text-bloat.

Replacing a couple thousand left and right unicode double quote marks with the left and right ASCII versions can reduce the file size quite a bit! A UTF-8 code is up to 7 characters, if leading zeroes are used. &#nnnn; One could write a whole text file that way but it'd be six times larger than using plain characters.

Another method that mostly works on HTML source files is to Save As Filtered HTML from Microsoft Word, but that can introduce its own issues with Microsoft's 'additions'.

05-20-2011, 05:49 PM	#200
bizzybody Addict Posts: 302 Karma: 8317682 Join Date: Apr 2007 Location: Idaho, USA Device: Various PalmOS PDAs, Android Phones, Sharper Image Literati	Here's a program that can read in a UTF-8 encoded HTML file and replace the UTF-8 HTML codes with the exact extended ASCII equivalent. https://www.mobileread.com/forums/sho...d.php?t=109996 It's not just for that, it can be used to process any text file and swap any specific string(s) with other text string(s). It's written in C# and needs a bit more debugging because if the replacement list is too long it does things it should not do. As is, it can handle enough to swap the most common accented characters used in English, as well as the punctuation characters. Debugged to handle any length swap list, it could be a very useful text file manipulation tool. It's already faster than any word processor or text editor for doing huge numbers of replacements. With a full character set swap file (which it currently can't handle) one could use it for one time pad cipher codes. Could even run a file through several swaps to swap words for code words then totally scramble all the letters. The receiving person would need correctly formatted swap lists, used in the right order, to unscramble and decode. WTH use UTF-8 for punctuation when ASCII and ordinary character encodings for Windows and other systems have characters like left and right quotes that produce exactly the same visible result? Unicode for standard characters when there's no need is text-bloat. Replacing a couple thousand left and right unicode double quote marks with the left and right ASCII versions can reduce the file size quite a bit! A UTF-8 code is up to 7 characters, if leading zeroes are used. &#nnnn; One could write a whole text file that way but it'd be six times larger than using plain characters. Another method that mostly works on HTML source files is to Save As Filtered HTML from Microsoft Word, but that can introduce its own issues with Microsoft's 'additions'.