Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 11-12-2007, 03:15 PM   #1
bob_ninja
Addict
bob_ninja will become famous soon enoughbob_ninja will become famous soon enoughbob_ninja will become famous soon enoughbob_ninja will become famous soon enoughbob_ninja will become famous soon enoughbob_ninja will become famous soon enough
 
Posts: 204
Karma: 582
Join Date: Aug 2006
Device: Zire71
Text tool for formatting Gutenberg text files

a.k.a.

What "Cleaning Up" Do Project Gutenberg Texts Need Part 2

Here is the download link for the

Txt4EBook tool

The tool is written in Java so you'll need the latest Java 6 software to be installed on your system. For downloads and more info go to

Sun's Java download page

The program file is already configured to run so long as the OS has Java system installed. In general that means you can start it either:

1) simply by double-clicking on the program file txt4ebook.jar icon in a GUI file manager

2) using the command in a console:

java -jar txt4ebook.jar

Either method should work for most machines. If you have problems then consult the Sun's help pages. Again, you need the latest VERSION 6 of Java!!!

I only created it the other week, so it still doesn't even have a version number. I'll try to incorporate more functionality based on your comments, but don't expect too much. It is only a side project for me, limited time.

Its primary goal is to simply process a text file and not change its formatting to another more advanced format like HTML. So my goals are very modest. The primary goal is to do whatever processing is necessary to prepare a Gutenberg text file for a reader device (including text 2 voice reading software). That means simple manipulations.

Still, I will include ability to add custom defined manipulations so that you can process ANY text file for ANY purpose (keep the processor more or less general purpose). However, defaults are preset for Gutenberg text based on my preferences. At some point I'll try to add other preferences and/or ability to load/save user preferences.

Anyway so much for now. This version simply formats a paragraph lines by removing extra line breaks. There is also optional paragraph indentation option. Next I'll add tab processing and custom regular expression filters (for removing things like Page XXX).

I hope you find it useful.

P.S.: I am using the latest Cybook reader, so default settings are geared for it.

Last edited by bob_ninja; 11-12-2007 at 03:18 PM.
bob_ninja is offline   Reply With Quote
Old 11-12-2007, 08:30 PM   #2
RWood
Technogeezer
RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.
 
RWood's Avatar
 
Posts: 7,233
Karma: 1596436
Join Date: Nov 2006
Location: Virginia, USA
Device: Sony PRS-500
Very interesting.

I used it with a rather large PG file and it did rather well for the most part. It still missed converting many sections of block text. While it would not combine text blocks that had indents as the first few characters, the ones I mention were flush left without indentation.

One of the first features you need to add is an option to change the name of the output file rather than assume that we all want it put back on top of the original file. (As you said, this is your setting.)
RWood is offline   Reply With Quote
Old 11-12-2007, 10:27 PM   #3
bob_ninja
Addict
bob_ninja will become famous soon enoughbob_ninja will become famous soon enoughbob_ninja will become famous soon enoughbob_ninja will become famous soon enoughbob_ninja will become famous soon enoughbob_ninja will become famous soon enough
 
Posts: 204
Karma: 582
Join Date: Aug 2006
Device: Zire71
Here is an example I believe you refer to:

Quote:
As a matter of fact, Mr. Bright left roughly speaking about one-fifth of
the whole Diary still unprinted, although he transcribed the whole, and
bequeathed his transcript to Magdalene College.
Please see the "General" tab, "Minimum paragraph length" setting.
The default value is 300 characters, or using 80 character lines almost 4 lines. So it determined that the section above is not really a paragraph and didn't process it. Now at some point I could/should add a more sophisticated semantics analyzer that would be smarter in distinguishing paragraphs from other sections. For now this simple check will have to do. So try reducing the value to a lower number. Perhaps I should use a smaller default.

Here some examples of sections that are NOT a paragraph and should not be processed:

Quote:
Release Date: November, 2004 [EBook #6933]
[Yes, we are more than one year ahead of schedule]
[This file was first posted on February 13, 2003]
These lines are separate, less than 300 characters, so remain unchanged.

Quote:
CHAPTER III.

1632, 1633.

PAUL LE JEUNE.

Le Jeune's Voyage.--His First Pupils.--His Studies.--
His Indian Teacher.--Winter at the Mission-house.--
Le Jeune's School.--Reinforcements.
A chapter TOC, again not a paragraph although one could argue that it is a paragraph of sorts that should be processed.

Quote:
CHAPTER XII.

1639, 1640.

THE TOBACCO NATION.--THE NEUTRALS.

A Change of Plan.--Sainte Marie.--Mission of the Tobacco Nation.--
Winter Journeying.--Reception of the Missionaries.--
Superstitious Terrors.--Peril of Garnier and Jogues.--
Mission of the Neutrals.--Huron Intrigues.--Miracles.--
Fury of the Indians.--Intervention of Saint Michael.--
Return to Sainte Marie.--Intrepidity of the Priests.--
Their Mental Exaltation.
Another TOC this time is processed because it is longer than 300 characters. So my simple rule is not very smart/effective. Like I said, it will do for now.

So in summary, reduce minimum paragraph length if you find some paragraphs are not processed but you want them to be processed.
bob_ninja is offline   Reply With Quote
Old 11-12-2007, 11:42 PM   #4
RWood
Technogeezer
RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.
 
RWood's Avatar
 
Posts: 7,233
Karma: 1596436
Join Date: Nov 2006
Location: Virginia, USA
Device: Sony PRS-500
Thanks for the pointers. My editors don't show backup copies so I missed the backup that your program made. (I also had another copy just in case.)

It does go a long way. I have historically used Stingo's Word Macro which only reacts to double <CR>s.

I will work with more. I see a lot of potential there.
RWood is offline   Reply With Quote
Old 11-13-2007, 07:47 AM   #5
bob_ninja
Addict
bob_ninja will become famous soon enoughbob_ninja will become famous soon enoughbob_ninja will become famous soon enoughbob_ninja will become famous soon enoughbob_ninja will become famous soon enoughbob_ninja will become famous soon enough
 
Posts: 204
Karma: 582
Join Date: Aug 2006
Device: Zire71
Quote:
Originally Posted by RWood View Post
Thanks for the pointers. My editors don't show backup copies so I missed the backup that your program made. (I also had another copy just in case.)
It creates a backup. Also note the File - Undo Processing command that reverts back to the original if you don't like a result.
Still I'll add the option for a different output file.
bob_ninja is offline   Reply With Quote
Old 11-13-2007, 12:28 PM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 25,952
Karma: 5036099
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
From a technical perspective, I haven't looked at your code, but I suggest you create an internal object model of the txt file so that it becomes easy to support different output formats in the future. It will be a little slower, but I think it's worth it.
kovidgoyal is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
eBook PDF - free tool for creating PDF eBooks from text files KACartlidge PDF 6 01-04-2012 09:41 AM
Utility for Project Gutenberg Text Files rocketgranny Deals, Freebies, and Resources (No Self-Promotion) 7 03-20-2010 02:44 AM
help with formatting text files chooky Workshop 2 11-26-2009 04:16 AM
Text formatting for .txt files motorhead HanLin eBook 9 01-08-2009 06:29 PM
PRS-500 Text Formatting Tool tesseract420 Sony Reader Dev Corner 5 09-13-2007 05:36 PM


All times are GMT -4. The time now is 07:05 PM.


MobileRead.com is a privately owned, operated and funded community.