04-27-2009, 03:20 PM | #1 |
Fanatic
Posts: 514
Karma: 2954711
Join Date: May 2006
|
How to deal with irregular hard-wrapping on a large scale?
Lately I've been writing some columns on TeleRead looking back at writing groups that were cranking out Internet fiction years before the term "e-book" even entered common usage. Here's the first two:
http://www.teleread.org/2009/04/26/supergu/ http://www.teleread.org/2009/04/27/t...-of-netheroes/ These groups (and the others I'll be covering in future entries) have copious archives. At last count, Superguy had over 12 and 1/2 million words in its archives. It even has a handy CGI script to retrieve just those archives which need reading. That quantity of material just cries out for reading on an e-book device, rather than sitting at the screen. But it also begs the question: how? It comes from the era of green- or amber-screened monospace ASCII terminals, text-only e-mail and USENET, and hard wrapping at the end of the line. You can't put that through an e-book converter without unwrapping it first somehow. And to make matters worse, the method of paragraph separation isn't consistent. A lot of writers use paragraph indenting with no space, some use blank spaces, some use both. Section separations aren't consistent either. (Some of them even use ASCII art or logos, but we can not bother worrying about those. Also, they tend to use two spaces after a period instead of one, due to the typographical conventions of monospace fonts.) A friend whipped up some perl scripts for me that can be used to unwrap indented or space-separated text, or to kill indents in something that uses both styles—but those rely on being able and knowing how to run perl scripts (which I do, but not every potential reader would), and they rely on what you're unwrapping all being in the same style. Anyone know of a simpler solution accessible to more people? (Or, failing that, that can at least be used on mixed-style archives and still unwrap properly?) |
04-27-2009, 04:38 PM | #2 |
Sir Penguin of Edinburgh
Posts: 12,375
Karma: 23555235
Join Date: Apr 2007
Location: DC Metro area
Device: Shake a stick plus 1
|
I can help you do it. I had about 500 words explaining how but I lost it due to a bad connection, dammit. Here is enough to give you an idea:
Steps: 1, eyeball enough files to identify all the formatting styles; (1a, write the programs; ) 2, run all the files through an identifier program (I can write it for you), get back a copy of each file with new text added to the first few lines (needed to ID the style of a given file); 3, run all the copies of the files through a cleaner program which will perform specific actions to fix a given style. Preferred Tools: jflex (or flex); If I did it, I would take the information learned in step one and write regular expression to define each detail. I would then use jflex to write the actual source code for the programs in steps 2 and 3. P.S. If you want to be really adventurous, you could combine the latter two steps by writing and running a yacc/jflex parser. It would be a lot more work, though. P.P.S. I could do this for you. Recently I've been doing something very similar to this. The World Fact eBook I converted required several runs through various cleanup programs in order to remove the excess web formatting. |
Advert | |
|
04-27-2009, 04:45 PM | #3 |
Fanatic
Posts: 514
Karma: 2954711
Join Date: May 2006
|
If you wanted to, that would be great, but I predict you would have a problem with step 1.
22 years and 13 million words of archives? Literally dozens of different people posting, with no guarantee each one consistently used the same styles throughout? Even if I knew how to program, I doubt I'd have the nerve to try it. :P |
04-27-2009, 05:00 PM | #4 |
Sir Penguin of Edinburgh
Posts: 12,375
Karma: 23555235
Join Date: Apr 2007
Location: DC Metro area
Device: Shake a stick plus 1
|
Where are the files, and who is maintaining them? I will need that person's email address.
I'll take a look. If I accomplish anything I'll let you know. |
04-27-2009, 05:05 PM | #5 |
Fanatic
Posts: 514
Karma: 2954711
Join Date: May 2006
|
The archive files are located at
http://archives.eyrie.org/superguy/ or ftp://ftp.eyrie.org/pub/superguy/ The script that allows automated viewing/retrieval of the archives is at http://www.eyrie.org/cgi-bin/autocollect.cgi The fellow who maintains the archive would be Russ Allbery, sysadmin of Eyrie.org, rra@eyrie.org. |
Advert | |
|
04-27-2009, 05:22 PM | #6 |
creator of calibre
Posts: 44,348
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
As long as you're not looking for perfection in the results, this should be rather easy to do. You've identified only a handful of different wrapping styles, a program to convert them to HTML should be doable in about 1000 lines of python. Note that I'm not volunteering as all my development time is soaked up by calibre.
|
04-27-2009, 05:41 PM | #7 | |
Sir Penguin of Edinburgh
Posts: 12,375
Karma: 23555235
Join Date: Apr 2007
Location: DC Metro area
Device: Shake a stick plus 1
|
Quote:
|
|
04-27-2009, 08:06 PM | #8 |
Sir Penguin of Edinburgh
Posts: 12,375
Karma: 23555235
Join Date: Apr 2007
Location: DC Metro area
Device: Shake a stick plus 1
|
I contacted Russ and got permission to spider the archive. The uncompressed text files take up 77MB, and number 1320.
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Dictionary converter and irregular inflections | ninpuukamui | PocketBook | 8 | 03-15-2020 10:42 AM |
Anyone ever do a large scale professional ePub conversion? | CharlesinCharge | ePub | 15 | 09-14-2010 01:32 PM |
Line un-wrapping Factor | jjansen | Calibre | 6 | 08-18-2010 12:21 AM |
Images and text wrapping | steveboyett | Calibre | 3 | 07-20-2010 08:26 PM |
Stop line wrapping at quotes at the end of a paragraph | sherman | ePub | 6 | 05-13-2010 02:52 PM |