Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 04-27-2009, 03:20 PM   #1
Robotech_Master
Fanatic
Robotech_Master ought to be getting tired of karma fortunes by now.Robotech_Master ought to be getting tired of karma fortunes by now.Robotech_Master ought to be getting tired of karma fortunes by now.Robotech_Master ought to be getting tired of karma fortunes by now.Robotech_Master ought to be getting tired of karma fortunes by now.Robotech_Master ought to be getting tired of karma fortunes by now.Robotech_Master ought to be getting tired of karma fortunes by now.Robotech_Master ought to be getting tired of karma fortunes by now.Robotech_Master ought to be getting tired of karma fortunes by now.Robotech_Master ought to be getting tired of karma fortunes by now.Robotech_Master ought to be getting tired of karma fortunes by now.
 
Posts: 514
Karma: 2954711
Join Date: May 2006
How to deal with irregular hard-wrapping on a large scale?

Lately I've been writing some columns on TeleRead looking back at writing groups that were cranking out Internet fiction years before the term "e-book" even entered common usage. Here's the first two:

http://www.teleread.org/2009/04/26/supergu/
http://www.teleread.org/2009/04/27/t...-of-netheroes/

These groups (and the others I'll be covering in future entries) have copious archives. At last count, Superguy had over 12 and 1/2 million words in its archives. It even has a handy CGI script to retrieve just those archives which need reading.

That quantity of material just cries out for reading on an e-book device, rather than sitting at the screen. But it also begs the question: how? It comes from the era of green- or amber-screened monospace ASCII terminals, text-only e-mail and USENET, and hard wrapping at the end of the line. You can't put that through an e-book converter without unwrapping it first somehow.

And to make matters worse, the method of paragraph separation isn't consistent. A lot of writers use paragraph indenting with no space, some use blank spaces, some use both. Section separations aren't consistent either. (Some of them even use ASCII art or logos, but we can not bother worrying about those. Also, they tend to use two spaces after a period instead of one, due to the typographical conventions of monospace fonts.)

A friend whipped up some perl scripts for me that can be used to unwrap indented or space-separated text, or to kill indents in something that uses both styles—but those rely on being able and knowing how to run perl scripts (which I do, but not every potential reader would), and they rely on what you're unwrapping all being in the same style.

Anyone know of a simpler solution accessible to more people? (Or, failing that, that can at least be used on mixed-style archives and still unwrap properly?)
Robotech_Master is offline   Reply With Quote
Old 04-27-2009, 04:38 PM   #2
Nate the great
Sir Penguin of Edinburgh
Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.
 
Nate the great's Avatar
 
Posts: 12,375
Karma: 23555235
Join Date: Apr 2007
Location: DC Metro area
Device: Shake a stick plus 1
I can help you do it. I had about 500 words explaining how but I lost it due to a bad connection, dammit. Here is enough to give you an idea:

Steps:
1, eyeball enough files to identify all the formatting styles;

(1a, write the programs; )

2, run all the files through an identifier program (I can write it for you), get back a copy of each file with new text added to the first few lines (needed to ID the style of a given file);

3, run all the copies of the files through a cleaner program which will perform specific actions to fix a given style.

Preferred Tools: jflex (or flex);

If I did it, I would take the information learned in step one and write regular expression to define each detail. I would then use jflex to write the actual source code for the programs in steps 2 and 3.


P.S. If you want to be really adventurous, you could combine the latter two steps by writing and running a yacc/jflex parser. It would be a lot more work, though.

P.P.S. I could do this for you. Recently I've been doing something very similar to this. The World Fact eBook I converted required several runs through various cleanup programs in order to remove the excess web formatting.
Nate the great is offline   Reply With Quote
Advert
Old 04-27-2009, 04:45 PM   #3
Robotech_Master
Fanatic
Robotech_Master ought to be getting tired of karma fortunes by now.Robotech_Master ought to be getting tired of karma fortunes by now.Robotech_Master ought to be getting tired of karma fortunes by now.Robotech_Master ought to be getting tired of karma fortunes by now.Robotech_Master ought to be getting tired of karma fortunes by now.Robotech_Master ought to be getting tired of karma fortunes by now.Robotech_Master ought to be getting tired of karma fortunes by now.Robotech_Master ought to be getting tired of karma fortunes by now.Robotech_Master ought to be getting tired of karma fortunes by now.Robotech_Master ought to be getting tired of karma fortunes by now.Robotech_Master ought to be getting tired of karma fortunes by now.
 
Posts: 514
Karma: 2954711
Join Date: May 2006
If you wanted to, that would be great, but I predict you would have a problem with step 1.

22 years and 13 million words of archives? Literally dozens of different people posting, with no guarantee each one consistently used the same styles throughout? Even if I knew how to program, I doubt I'd have the nerve to try it. :P
Robotech_Master is offline   Reply With Quote
Old 04-27-2009, 05:00 PM   #4
Nate the great
Sir Penguin of Edinburgh
Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.
 
Nate the great's Avatar
 
Posts: 12,375
Karma: 23555235
Join Date: Apr 2007
Location: DC Metro area
Device: Shake a stick plus 1
Where are the files, and who is maintaining them? I will need that person's email address.

I'll take a look. If I accomplish anything I'll let you know.
Nate the great is offline   Reply With Quote
Old 04-27-2009, 05:05 PM   #5
Robotech_Master
Fanatic
Robotech_Master ought to be getting tired of karma fortunes by now.Robotech_Master ought to be getting tired of karma fortunes by now.Robotech_Master ought to be getting tired of karma fortunes by now.Robotech_Master ought to be getting tired of karma fortunes by now.Robotech_Master ought to be getting tired of karma fortunes by now.Robotech_Master ought to be getting tired of karma fortunes by now.Robotech_Master ought to be getting tired of karma fortunes by now.Robotech_Master ought to be getting tired of karma fortunes by now.Robotech_Master ought to be getting tired of karma fortunes by now.Robotech_Master ought to be getting tired of karma fortunes by now.Robotech_Master ought to be getting tired of karma fortunes by now.
 
Posts: 514
Karma: 2954711
Join Date: May 2006
The archive files are located at

http://archives.eyrie.org/superguy/ or ftp://ftp.eyrie.org/pub/superguy/

The script that allows automated viewing/retrieval of the archives is at

http://www.eyrie.org/cgi-bin/autocollect.cgi

The fellow who maintains the archive would be Russ Allbery, sysadmin of Eyrie.org, rra@eyrie.org.
Robotech_Master is offline   Reply With Quote
Advert
Old 04-27-2009, 05:22 PM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,843
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
As long as you're not looking for perfection in the results, this should be rather easy to do. You've identified only a handful of different wrapping styles, a program to convert them to HTML should be doable in about 1000 lines of python. Note that I'm not volunteering as all my development time is soaked up by calibre.
kovidgoyal is offline   Reply With Quote
Old 04-27-2009, 05:41 PM   #7
Nate the great
Sir Penguin of Edinburgh
Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.
 
Nate the great's Avatar
 
Posts: 12,375
Karma: 23555235
Join Date: Apr 2007
Location: DC Metro area
Device: Shake a stick plus 1
Quote:
Originally Posted by Robotech_Master View Post

The fellow who maintains the archive would be Russ Allbery, sysadmin of Eyrie.org, rra@eyrie.org.
The email isn't valid.
Nate the great is offline   Reply With Quote
Old 04-27-2009, 08:06 PM   #8
Nate the great
Sir Penguin of Edinburgh
Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.
 
Nate the great's Avatar
 
Posts: 12,375
Karma: 23555235
Join Date: Apr 2007
Location: DC Metro area
Device: Shake a stick plus 1
I contacted Russ and got permission to spider the archive. The uncompressed text files take up 77MB, and number 1320.
Nate the great is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Dictionary converter and irregular inflections ninpuukamui PocketBook 8 03-15-2020 10:42 AM
Anyone ever do a large scale professional ePub conversion? CharlesinCharge ePub 15 09-14-2010 01:32 PM
Line un-wrapping Factor jjansen Calibre 6 08-18-2010 12:21 AM
Images and text wrapping steveboyett Calibre 3 07-20-2010 08:26 PM
Stop line wrapping at quotes at the end of a paragraph sherman ePub 6 05-13-2010 02:52 PM


All times are GMT -4. The time now is 02:38 PM.


MobileRead.com is a privately owned, operated and funded community.