MobileRead Forums - View Single Post

Greg Anos · 11-16-2008, 10:41 AM

Quote:

Originally Posted by ficbot

I am looking for an easier way to convert a large batch of plain text ebooks that my sister sent me. There are lots of messy files with many paragraph breaks, or no paragraph breaks and other such issues. I was importing them into Open Office and manually going through to remove the breaks (using the CF option when I import them as some had no breaks at all) but still getting a lot of garbage in them once I converted them into pdb files to load on the ipod.

I finally figured out a way to get the books to appear in a satisfactory way in the finished file, but it is very labour-intensive, using Neo Office and Kompozer, which is an HTML program:

1) Open it in Neo Office and if it gives the option, say 'cf' only
2) Manually scan the document for large gaps and remove them
3) Save the file as a plain text document
4) Re-open the file
5) Select-all and copy
6) Paste it into a new window in Kompozer
7) In Kompozer, Select-all and copy
8) Paste this into a new Neo Office document
9) Save this as a Word file
10) Use conversion program to convert Word to PDB

Isn't there an easier way? All I want is regular old text, one line break between paragraphs, nothing fancy. It seems though that depending on the program originally used to make the text file, there are tabs or special characters used to indicate the line breaks, and I don't see them in Neo Office, but I do once the file is converted. It seems the only way to get "clean" text is to paste it into a web page program, which generates proper paragraph breaks where the line breaks are, and then when I paste that back into Neo Office, everything is fine. But this whole process can take upwards of 15 minutes per book!

I am on a Mac here and I don't have MS software on it. I ave Neo Office, and Pages. I am willing to buy a new program if need be, but as I am on a Mac, I suspect my options may be limited.

Advice?

I can't help you (in detail) with a Mac, but what you need is a good hex file editor. I use HexEdit 3.10 on a Windows machine. With a good hex file editor, you can open the .txt file, and see all the non-display control characters. Then you can find/replace and change them to whatever you need, as a batch process. It make take 2 or 3 passed to process the file exactly the way you want it, but it works. And you can do changes this way for any clear text file control set. (HTML, RTF, ect.)