View Single Post
Old 03-26-2007, 12:10 AM   #12
mogui is no ebook tyro.mogui is no ebook tyro.mogui is no ebook tyro.mogui is no ebook tyro.mogui is no ebook tyro.mogui is no ebook tyro.mogui is no ebook tyro.mogui is no ebook tyro.mogui is no ebook tyro.mogui is no ebook tyro.
mogui's Avatar
Posts: 503
Karma: 1335
Join Date: Dec 2006
Location: The Philippines
Device: HTC G1 Android FBReader
Special characters

I do file conversions frequently, and by various means. I fixed my old broken TRGpro, so now I use it for reading again. I am in China so I can't just go to CompUSA or Fry's and pick up a new device. Though I have plenty of e-books in Palm format, most of my files are in ".pdf" or ".txt" formats. I have found Plucker to be a wonderful way to achieve readable formatting of those files with very little effort. I do not need to futz with line endings and such.

But when I was formatting text for my 24 character line MP4 player I became familiar with the necessary gyrations which I will outline here.

I converted the Palm TX manual from pdf to txt to use here as an example. It is freely downloadable. I saved it as text using Acrobat. I brought it up in PSPad, an excellent freeware programmer's editor. Then I was able to view the txt file in hex mode. The first little bit looks like this:

"User Guide 0D0A 0D0A 0D0A 0C 0D0A Copyright and"

The ASCII codes are:
0D0A = carriage return, line feed (CRLF).
0C = form feed (FF), or page break.

The CRLF is what you are calling a paragraph mark. It commonly shows in most editors as a paragraph symbol. This symbol is not a part of the ASCII character set. The form feed is a page break. Often the FF is close to a number or a repeated string. This is a good clue for the identification of text you might wish to remove.

My problem was to reformat the text to fit a 24 character line and reflow the text at word breaks. To do this I followed the following steps:
1/ Replace all CRLF sequences with <*>CRLF.
2/ Select all the text in my editor (PFE in this case) and reflow it.
3/ Replace all <*>CRLF sequences with CRLF.

Some good free programmer's editors are : PSPad, ConTEXT and PFE32.

Now the text has the proper paragraph formatting and text breaks occur between words. In other words, it is readable. Now if I wish, I can replace CRLFCRLF sequences with CRLF to eliminate extra line spacing. The form feed "0C" character can be replaced by a space or a CRLF sequence -- your choice. I usually replace tabs with 2 spaces and later crunch spaces down by repeatedly replacing SPSP with SP.

There is a small free conversion utility called storymaster that will do the above operations quite satisfactorily. The site will link you to a free download of html2txt that includes the source code, in case you are starting with an html file. Of course the problem stated at the beginning of this thread is more complex, which is why I do not simply give you the link to storymaster and be done with it.

This thread discusses scripting. Windows has a nice scripting facility built in. It is called Windows Scripting Host, or WSH. With it you can use Visual Basic scripts or Javascripts and run them just like regular programs, Microsoft is quite happy to teach you all about it here. Help with VBScript is only a click away. You can find lots of script examples on the web, so you don't have to start from scratch. I never do.

I am attaching a Visual Basic WSH script I used to reformat text to 24 character lines for my MP4 player. I have added the extension ".txt" to the filename so the uploader would accept it. Remember to virus-scan all executables before running them. Open it in Notepad or your favorite editor. Remove the ".txt" extension before running it. It is not perfect. I never bothered to clean it up. I adapted it from a script that converted html to text. I have chopped out irrelevant code. I have commented-out part of it that is involved with string replacement. You can use that section for your own experiments.

Back in the old days, running unix systems, we used to use AWK for all our text reformatting needs. It is a dream come true for changing text strings, and it does not take long to learn. After all, programmers can use it. Whenever we wanted to tell programmers from other people we would just point to something. Ordinary people would look where we pointed. The programmers would always look at our finger. So, if they can do it you can do it. This forum is populated with highly intelligent people.

This site offers sample AWK scripts and will help one to learn to use AWK. It refers to AWK as a programming language. I am sorry. It is really a simple command line utility. Here are some simple one line examples of how to manipulate text. Once you get used to it you will love it.

In summary, it is essential to be able to see the raw data in your files to understand what you need to do. I like hex editors for this. PSPad has a hex viewing mode. Find an ASCII chart you like and link to it. Then you can understand what the character codes are. Many programming editors will allow you to do search and replace using character codes. Sometimes they look like this "/f" (FF), or "/n" (CRLF). Sometimes they are hex codes like this, "0x0c" (FF) or "0x0d0x0a" (CRLF). See the help files for your editor under "regular expressions". By experimenting with different replacement sequences you can learn what needs to be done. Then you are ready to use a script or a tool like AWK.

If you create a useful script or AWK command line, please consider posting it here so we can all learn.
Attached Files
File Type: txt Text formatter.vbs.txt (3.5 KB, 270 views)

Last edited by mogui; 03-26-2007 at 04:11 AM.
mogui is offline   Reply With Quote