View Single Post
Old 05-16-2009, 02:12 AM   #6
rogue_ronin
Banned
rogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-books
 
Posts: 475
Karma: 796
Join Date: Sep 2008
Location: Honolulu
Device: Nokia 770 (fbreader)
The problem is the markup, or lack thereof. The lack of information is hard to undo.

Many, if not most, of the source files that I find are plain text that someone may have hurriedly converted to a basic HTML, with little to no effort to provide accurate markup, or with their personal quirks embedded (I just looked at two files that had a blank paragraph following every paragraph -- why?! And I only realized that after de-crufting the files.) Often it's just text wrapped in <PRE> tags, unjoined paragraphs, or encrufted by the dreaded Word. There has been no effort to mark chapters, italics, bold, titles, subtitles, images, maps, tables, etc. There's rarely, if ever, any meta-information.

If it comes from Gutenberg, it will likely have (but may not) hard-returns to mark the chapters and titles. It may (or may not) have simple, textual markup tags like _ for underlining, * for bold, and ~ for italic -- although it varies from book to book. Gutenberg text files have hard line-wrapping as well. (Their HTML versions are far superior, especially since the Distributed Proofreaders days. I don't include them in this critique.) At least the Gutenberg stuff can be parsed -- check out Gutenmark. It's GPL, so you could probably start with its source code.

There are some remarkably clever simple textual markup systems. Check out Markdown, for one. The problem is that no one uses them. And you're left trying to redo something that was improperly done.

Certainly, virtually anything you find that didn't come from a professional source originally has lost its markup in the conversion. (And I'm of the mind that professional is an attitude, and a commitment, not just getting paid.) You sound like you want to do pro work. Most people just want to read, seem to have no sense of style, and don't think about all the information that they're losing when they make a crappy text file. Somewhere, (Yahoo books group?) I remember, I was feeling compelled to post to people about stripping markup from HTML files and redistributing them -- these folks actually think that they're doing a good thing! I believe they started reading books on their old TI calculators, and never moved on. (It's easy to strip markup, hard to add it. Destruction/Creation, Entropy/Life -- which side are you on?)

I don't know if it is possible to accurately program a utility for all the stupid ways people cruft up their files, but I'm hopeful. Maybe an AI could do it. (Come on, Skynet!) I'm looking for something myself that will just simplify the (X)HTML to basic <P align>, <H# align>, <BR>, <B>, <I>, <U>, <IMG>, <A>, <A HREF>, <CODE> and <BLOCKQUOTE> (is anything else necessary for most books?) -- adding styles later is much easier. (BTW, FB2 is an excellent schema for books, check it out. It adds cool stuff, allowing for poetry, etc.)

Oh, there's a nice utility called txt2html that looks promising for reasonably formatted text files. Millions of options.

I hope you can make such a utility. I'd gladly use it, especially if it were scriptable. Or heck, just use some combination of smart utilities like HTMLtidy, txt2html and hstrip (mentioned in link.) Add your own genius. Most of this stuff starts out in Perl, which you seem to understand. (I don't.)

Hope I've helped, and not just added to the noise,

m a r
rogue_ronin is offline   Reply With Quote