View Full Version : Text tool for formatting Gutenberg text files


bob_ninja
11-12-2007, 03:15 PM
a.k.a.

What "Cleaning Up" Do Project Gutenberg Texts Need Part 2

Here is the download link for the

Txt4EBook tool (http://www.dekksoft.com/text_tools/software/txt4ebook.jar)

The tool is written in Java so you'll need the latest Java 6 software to be installed on your system. For downloads and more info go to

Sun's Java download page (http://java.sun.com/javase/downloads/?intcmp=1281)

The program file is already configured to run so long as the OS has Java system installed. In general that means you can start it either:

1) simply by double-clicking on the program file txt4ebook.jar icon in a GUI file manager

2) using the command in a console:

java -jar txt4ebook.jar

Either method should work for most machines. If you have problems then consult the Sun's help pages. Again, you need the latest VERSION 6 of Java!!!

I only created it the other week, so it still doesn't even have a version number. I'll try to incorporate more functionality based on your comments, but don't expect too much. It is only a side project for me, limited time.

Its primary goal is to simply process a text file and not change its formatting to another more advanced format like HTML. So my goals are very modest. The primary goal is to do whatever processing is necessary to prepare a Gutenberg text file for a reader device (including text 2 voice reading software). That means simple manipulations.

Still, I will include ability to add custom defined manipulations so that you can process ANY text file for ANY purpose (keep the processor more or less general purpose). However, defaults are preset for Gutenberg text based on my preferences. At some point I'll try to add other preferences and/or ability to load/save user preferences.

Anyway so much for now. This version simply formats a paragraph lines by removing extra line breaks. There is also optional paragraph indentation option. Next I'll add tab processing and custom regular expression filters (for removing things like Page XXX).

I hope you find it useful.

P.S.: I am using the latest Cybook reader, so default settings are geared for it.

RWood
11-12-2007, 08:30 PM
Very interesting.

I used it with a rather large PG file and it did rather well for the most part. It still missed converting many sections of block text. While it would not combine text blocks that had indents as the first few characters, the ones I mention were flush left without indentation.

One of the first features you need to add is an option to change the name of the output file rather than assume that we all want it put back on top of the original file. (As you said, this is your setting.)

bob_ninja
11-12-2007, 10:27 PM
Here is an example I believe you refer to:

As a matter of fact, Mr. Bright left roughly speaking about one-fifth of
the whole Diary still unprinted, although he transcribed the whole, and
bequeathed his transcript to Magdalene College.

Please see the "General" tab, "Minimum paragraph length" setting.
The default value is 300 characters, or using 80 character lines almost 4 lines. So it determined that the section above is not really a paragraph and didn't process it. Now at some point I could/should add a more sophisticated semantics analyzer that would be smarter in distinguishing paragraphs from other sections. For now this simple check will have to do. So try reducing the value to a lower number. Perhaps I should use a smaller default.

Here some examples of sections that are NOT a paragraph and should not be processed:

Release Date: November, 2004 [EBook #6933]
[Yes, we are more than one year ahead of schedule]
[This file was first posted on February 13, 2003]


These lines are separate, less than 300 characters, so remain unchanged.

CHAPTER III.

1632, 1633.

PAUL LE JEUNE.

Le Jeune's Voyage.--His First Pupils.--His Studies.--
His Indian Teacher.--Winter at the Mission-house.--
Le Jeune's School.--Reinforcements.

A chapter TOC, again not a paragraph although one could argue that it is a paragraph of sorts that should be processed.

CHAPTER XII.

1639, 1640.

THE TOBACCO NATION.--THE NEUTRALS.

A Change of Plan.--Sainte Marie.--Mission of the Tobacco Nation.--
Winter Journeying.--Reception of the Missionaries.--
Superstitious Terrors.--Peril of Garnier and Jogues.--
Mission of the Neutrals.--Huron Intrigues.--Miracles.--
Fury of the Indians.--Intervention of Saint Michael.--
Return to Sainte Marie.--Intrepidity of the Priests.--
Their Mental Exaltation.

Another TOC this time is processed because it is longer than 300 characters. So my simple rule is not very smart/effective. Like I said, it will do for now.

So in summary, reduce minimum paragraph length if you find some paragraphs are not processed but you want them to be processed.

RWood
11-12-2007, 11:42 PM
Thanks for the pointers. My editors don't show backup copies so I missed the backup that your program made. (I also had another copy just in case.)

It does go a long way. I have historically used Stingo's Word Macro which only reacts to double <CR>s.

I will work with more. I see a lot of potential there.

bob_ninja
11-13-2007, 07:47 AM
Thanks for the pointers. My editors don't show backup copies so I missed the backup that your program made. (I also had another copy just in case.)


It creates a backup. Also note the File - Undo Processing command that reverts back to the original if you don't like a result.
Still I'll add the option for a different output file.

kovidgoyal
11-13-2007, 12:28 PM
From a technical perspective, I haven't looked at your code, but I suggest you create an internal object model of the txt file so that it becomes easy to support different output formats in the future. It will be a little slower, but I think it's worth it.