MobileRead Forums - View Single Post - Yet Another Gutenberg Book/HTML converter

grebki · 12-23-2006, 02:57 PM

Here is a script adapted from FangornUK's nice scripts (OK, shamelessly copied.) It is for HTML's that are already split up and need no additional processing.

In order to use it, you must create a file list (dir > filelist.txt, then edit text file to clean it up into a clean list.)

The script reads the file list and calls HTML2LRF.exe with the list of files. It's easier than manually typing in a super-long list of files on the command line (30, 40, 50 chapters...)

Each HTML file is listed in a working TOC as well (HTML2LRF does this automagically.) You can also specify multiple tags for Chapter titles (for example, I had a book that used H3 for a tag along the lines of "Chapter 1" and then an H4 tag with the actual name of the chapter -- something like "Fastidious Incompetence." ) Anyway, hope it helps -- feel free to modify.

A few of notes:
* Needs to be run in the directory of HTML2LRF.EXE
* You can either specify a base directory with -d option -or- specify complete path-names for the files in the filelist.txt file. HTML2LRF needs to get complete path names or it doesn't work.
* If no chapter tags are specified, then the <title> <\title> tag will be used by HTML2LRF for the TOC entry (which is fine sometimes) otherwise, the script will replace <title> <\title> with the contents of the specified tags ... unless it can't find them, in which case it will leave the title unchanged.

To Do:
* Allow chapter headings by regular expression instead of HTML tags (i.e. "<p>Chapter something<\p>" or ALL CAPS)
* Pull in text files, split on regexp, remove arbitrary line-breaks, and convert to rudimentary HTML before combining into a BBeB (just to get that nice text-flow and TOC)

-G