View Full Version : Scripts or command-line utilities for simplifying HTML (linux preferred)


rogue_ronin
05-15-2009, 12:52 AM
Hey all,

I've been trying to prep a massive library of books -- stuff I've been collecting for years.

I recently condensed it all by author, backed it up, and wrote a couple of small scripts to explode all the LIT, PRC, PDB, ZIP and RAR files into individual directories, then convert all the DOC and RTF files into HTML (via Abiword command-line.) There was already a massive number of folders and HTML as well. I burned all the PDFs, moved all the RBs (I still have an REB1100) to a different library, and deleted all empty directories. I don't have any ePub or LRF or IMP files.

So that leaves me with Author folders containing a mix of subdirectories with HTML/TXT and images, or straight up HTML/TXT/images sitting in the Author folder.

I expect to spend many hours cleaning each Author tree -- removing duplicates, choosing the best single source for each book (preferably HTML) and finding any rogue elements. (Suggestions for this would be appreciated, too!) I intend to eventually move to ePub.

I've found what looks like an excellent tool for converting TXT to HTML (http://txt2html.sourceforge.net/), and it has the awesome name of txt2html. It's a perl utility with a million options. I might set it loose on the library -- probably a little too imprecise, but...

Okay, this long setup is to ask if anyone knows a good HTML simplifying utility or script. One that is smart enough to interpret CSS and <SPAN>s and other obfuscations -- and just leave a file with simple <P>, <H#>, <BR>, <BLOCKQUOTE>,<EM>, <STRONG>, etc. Removing <FONT>, color, <DIV>, scripts, stylesheets and other cruft. If it can combine multiple-HTML books into a single HTML file, that would be a bonus.

Most of the DOCs and RTFs were converted by Abiword, but some were there already and may have been created by Word or other converters. I know about some of the utilities out there, like demoronizer, but that won't work on the Abiword-converted.

I'd really prefer to be able to script it, so that the computer does the work.

I also keep a VM with Win2k so that I can run NoteTab Pro. I'm pretty familiar with the Clip macro language (in fact I have built a complete clip-library for processing ebooks that keeps a database, adds metadata, has simple themes, and suchlike.) So if you know a good Clip, that would work too.

Thanks for reading this far.

m a r

ruskie
05-15-2009, 07:07 AM
See HTML Tidy or just Tidy it has a ton of options

rogue_ronin
05-15-2009, 11:51 AM
Checking it out now, just might work.

I knew of it, of course -- but I thought it just did syntax checking.

Thanks!

m a r

ps: still taking suggestions!