How to deal with irregular hard-wrapping on a large scale?

Robotech_Master · 04-27-2009, 03:20 PM

Lately I've been writing some columns on TeleRead looking back at writing groups that were cranking out Internet fiction years before the term "e-book" even entered common usage. Here's the first two:

http://www.teleread.org/2009/04/26/supergu/
http://www.teleread.org/2009/04/27/t...-of-netheroes/

These groups (and the others I'll be covering in future entries) have copious archives. At last count, Superguy had over 12 and 1/2 million words in its archives. It even has a handy CGI script to retrieve just those archives which need reading.

That quantity of material just cries out for reading on an e-book device, rather than sitting at the screen. But it also begs the question: how? It comes from the era of green- or amber-screened monospace ASCII terminals, text-only e-mail and USENET, and hard wrapping at the end of the line. You can't put that through an e-book converter without unwrapping it first somehow.

And to make matters worse, the method of paragraph separation isn't consistent. A lot of writers use paragraph indenting with no space, some use blank spaces, some use both. Section separations aren't consistent either. (Some of them even use ASCII art or logos, but we can not bother worrying about those. Also, they tend to use two spaces after a period instead of one, due to the typographical conventions of monospace fonts.)

A friend whipped up some perl scripts for me that can be used to unwrap indented or space-separated text, or to kill indents in something that uses both styles—but those rely on being able and knowing how to run perl scripts (which I do, but not every potential reader would), and they rely on what you're unwrapping all being in the same style.

Anyone know of a simpler solution accessible to more people? (Or, failing that, that can at least be used on mixed-style archives and still unwrap properly?)

Nate the great · 04-27-2009, 04:38 PM

I can help you do it. I had about 500 words explaining how but I lost it due to a bad connection, dammit. Here is enough to give you an idea:

Steps:
1, eyeball enough files to identify all the formatting styles;

(1a, write the programs; )

2, run all the files through an identifier program (I can write it for you), get back a copy of each file with new text added to the first few lines (needed to ID the style of a given file);

3, run all the copies of the files through a cleaner program which will perform specific actions to fix a given style.

Preferred Tools: jflex (or flex);

If I did it, I would take the information learned in step one and write regular expression to define each detail. I would then use jflex to write the actual source code for the programs in steps 2 and 3.

P.S. If you want to be really adventurous, you could combine the latter two steps by writing and running a yacc/jflex parser. It would be a lot more work, though.

P.P.S. I could do this for you. Recently I've been doing something very similar to this. The World Fact eBook I converted required several runs through various cleanup programs in order to remove the excess web formatting.

Robotech_Master · 04-27-2009, 04:45 PM

If you wanted to, that would be great, but I predict you would have a problem with step 1.

22 years and 13 million words of archives? Literally dozens of different people posting, with no guarantee each one consistently used the same styles throughout? Even if I knew how to program, I doubt I'd have the nerve to try it. :P

Nate the great · 04-27-2009, 05:00 PM

Where are the files, and who is maintaining them? I will need that person's email address.

I'll take a look. If I accomplish anything I'll let you know.

Robotech_Master · 04-27-2009, 05:05 PM

The archive files are located at

http://archives.eyrie.org/superguy/ or ftp://ftp.eyrie.org/pub/superguy/

The script that allows automated viewing/retrieval of the archives is at

http://www.eyrie.org/cgi-bin/autocollect.cgi

The fellow who maintains the archive would be Russ Allbery, sysadmin of Eyrie.org, rra@eyrie.org.

kovidgoyal · 04-27-2009, 05:22 PM

As long as you're not looking for perfection in the results, this should be rather easy to do. You've identified only a handful of different wrapping styles, a program to convert them to HTML should be doable in about 1000 lines of python. Note that I'm not volunteering as all my development time is soaked up by calibre.

Nate the great · 04-27-2009, 05:41 PM

Quote:

Originally Posted by Robotech_Master

The fellow who maintains the archive would be Russ Allbery, sysadmin of Eyrie.org, rra@eyrie.org.

The email isn't valid.

Nate the great · 04-27-2009, 08:06 PM

I contacted Russ and got permission to spider the archive. The uncompressed text files take up 77MB, and number 1320.

04-27-2009, 03:20 PM	#1
Robotech_Master Fanatic Posts: 514 Karma: 2954711 Join Date: May 2006	How to deal with irregular hard-wrapping on a large scale? Lately I've been writing some columns on TeleRead looking back at writing groups that were cranking out Internet fiction years before the term "e-book" even entered common usage. Here's the first two: http://www.teleread.org/2009/04/26/supergu/ http://www.teleread.org/2009/04/27/t...-of-netheroes/ These groups (and the others I'll be covering in future entries) have copious archives. At last count, Superguy had over 12 and 1/2 million words in its archives. It even has a handy CGI script to retrieve just those archives which need reading. That quantity of material just cries out for reading on an e-book device, rather than sitting at the screen. But it also begs the question: how? It comes from the era of green- or amber-screened monospace ASCII terminals, text-only e-mail and USENET, and hard wrapping at the end of the line. You can't put that through an e-book converter without unwrapping it first somehow. And to make matters worse, the method of paragraph separation isn't consistent. A lot of writers use paragraph indenting with no space, some use blank spaces, some use both. Section separations aren't consistent either. (Some of them even use ASCII art or logos, but we can not bother worrying about those. Also, they tend to use two spaces after a period instead of one, due to the typographical conventions of monospace fonts.) A friend whipped up some perl scripts for me that can be used to unwrap indented or space-separated text, or to kill indents in something that uses both styles—but those rely on being able and knowing how to run perl scripts (which I do, but not every potential reader would), and they rely on what you're unwrapping all being in the same style. Anyone know of a simpler solution accessible to more people? (Or, failing that, that can at least be used on mixed-style archives and still unwrap properly?)

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Dictionary converter and irregular inflections	ninpuukamui	PocketBook	8	03-15-2020 10:42 AM
Anyone ever do a large scale professional ePub conversion?	CharlesinCharge	ePub	15	09-14-2010 01:32 PM
Line un-wrapping Factor	jjansen	Calibre	6	08-18-2010 12:21 AM
Images and text wrapping	steveboyett	Calibre	3	07-20-2010 08:26 PM
Stop line wrapping at quotes at the end of a paragraph	sherman	ePub	6	05-13-2010 02:52 PM

04-27-2009, 04:38 PM	#2
Nate the great Sir Penguin of Edinburgh Posts: 12,375 Karma: 23555235 Join Date: Apr 2007 Location: DC Metro area Device: Shake a stick plus 1	I can help you do it. I had about 500 words explaining how but I lost it due to a bad connection, dammit. Here is enough to give you an idea: Steps: 1, eyeball enough files to identify all the formatting styles; (1a, write the programs; ) 2, run all the files through an identifier program (I can write it for you), get back a copy of each file with new text added to the first few lines (needed to ID the style of a given file); 3, run all the copies of the files through a cleaner program which will perform specific actions to fix a given style. Preferred Tools: jflex (or flex); If I did it, I would take the information learned in step one and write regular expression to define each detail. I would then use jflex to write the actual source code for the programs in steps 2 and 3. P.S. If you want to be really adventurous, you could combine the latter two steps by writing and running a yacc/jflex parser. It would be a lot more work, though. P.P.S. I could do this for you. Recently I've been doing something very similar to this. The World Fact eBook I converted required several runs through various cleanup programs in order to remove the excess web formatting.

04-27-2009, 04:45 PM	#3
Robotech_Master Fanatic Posts: 514 Karma: 2954711 Join Date: May 2006	If you wanted to, that would be great, but I predict you would have a problem with step 1. 22 years and 13 million words of archives? Literally dozens of different people posting, with no guarantee each one consistently used the same styles throughout? Even if I knew how to program, I doubt I'd have the nerve to try it. :P

04-27-2009, 05:00 PM	#4
Nate the great Sir Penguin of Edinburgh Posts: 12,375 Karma: 23555235 Join Date: Apr 2007 Location: DC Metro area Device: Shake a stick plus 1	Where are the files, and who is maintaining them? I will need that person's email address. I'll take a look. If I accomplish anything I'll let you know.

04-27-2009, 05:05 PM	#5
Robotech_Master Fanatic Posts: 514 Karma: 2954711 Join Date: May 2006	The archive files are located at http://archives.eyrie.org/superguy/ or ftp://ftp.eyrie.org/pub/superguy/ The script that allows automated viewing/retrieval of the archives is at http://www.eyrie.org/cgi-bin/autocollect.cgi The fellow who maintains the archive would be Russ Allbery, sysadmin of Eyrie.org, rra@eyrie.org.

04-27-2009, 05:22 PM	#6
kovidgoyal creator of calibre Posts: 44,348 Karma: 23661992 Join Date: Oct 2006 Location: Mumbai, India Device: Various	As long as you're not looking for perfection in the results, this should be rather easy to do. You've identified only a handful of different wrapping styles, a program to convert them to HTML should be doable in about 1000 lines of python. Note that I'm not volunteering as all my development time is soaked up by calibre.

04-27-2009, 08:06 PM	#8
Nate the great Sir Penguin of Edinburgh Posts: 12,375 Karma: 23555235 Join Date: Apr 2007 Location: DC Metro area Device: Shake a stick plus 1	I contacted Russ and got permission to spider the archive. The uncompressed text files take up 77MB, and number 1320.

Advert

Advert