Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Other formats

Notices

Reply
 
Thread Tools Search this Thread
Old 05-14-2009, 11:52 PM   #1
rogue_ronin
Banned
rogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-books
 
Posts: 475
Karma: 796
Join Date: Sep 2008
Location: Honolulu
Device: Nokia 770 (fbreader)
Lightbulb Scripts or command-line utilities for simplifying HTML (linux preferred)

Hey all,

I've been trying to prep a massive library of books -- stuff I've been collecting for years.

I recently condensed it all by author, backed it up, and wrote a couple of small scripts to explode all the LIT, PRC, PDB, ZIP and RAR files into individual directories, then convert all the DOC and RTF files into HTML (via Abiword command-line.) There was already a massive number of folders and HTML as well. I burned all the PDFs, moved all the RBs (I still have an REB1100) to a different library, and deleted all empty directories. I don't have any ePub or LRF or IMP files.

So that leaves me with Author folders containing a mix of subdirectories with HTML/TXT and images, or straight up HTML/TXT/images sitting in the Author folder.

I expect to spend many hours cleaning each Author tree -- removing duplicates, choosing the best single source for each book (preferably HTML) and finding any rogue elements. (Suggestions for this would be appreciated, too!) I intend to eventually move to ePub.

I've found what looks like an excellent tool for converting TXT to HTML, and it has the awesome name of txt2html. It's a perl utility with a million options. I might set it loose on the library -- probably a little too imprecise, but...

Okay, this long setup is to ask if anyone knows a good HTML simplifying utility or script. One that is smart enough to interpret CSS and <SPAN>s and other obfuscations -- and just leave a file with simple <P>, <H#>, <BR>, <BLOCKQUOTE>,<EM>, <STRONG>, etc. Removing <FONT>, color, <DIV>, scripts, stylesheets and other cruft. If it can combine multiple-HTML books into a single HTML file, that would be a bonus.

Most of the DOCs and RTFs were converted by Abiword, but some were there already and may have been created by Word or other converters. I know about some of the utilities out there, like demoronizer, but that won't work on the Abiword-converted.

I'd really prefer to be able to script it, so that the computer does the work.

I also keep a VM with Win2k so that I can run NoteTab Pro. I'm pretty familiar with the Clip macro language (in fact I have built a complete clip-library for processing ebooks that keeps a database, adds metadata, has simple themes, and suchlike.) So if you know a good Clip, that would work too.

Thanks for reading this far.

m a r
rogue_ronin is offline   Reply With Quote
Old 05-15-2009, 06:07 AM   #2
ruskie
Shade
ruskie will become famous soon enoughruskie will become famous soon enoughruskie will become famous soon enoughruskie will become famous soon enoughruskie will become famous soon enoughruskie will become famous soon enough
 
ruskie's Avatar
 
Posts: 100
Karma: 546
Join Date: Mar 2009
Location: U.LC.MW.Sol.Earth.EE.SI.LJ
Device: Hanlin Jinke V3ext running OpenInkPot
See HTML Tidy or just Tidy it has a ton of options
ruskie is offline   Reply With Quote
Advert
Old 05-15-2009, 10:51 AM   #3
rogue_ronin
Banned
rogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-books
 
Posts: 475
Karma: 796
Join Date: Sep 2008
Location: Honolulu
Device: Nokia 770 (fbreader)
Checking it out now, just might work.

I knew of it, of course -- but I thought it just did syntax checking.

Thanks!

m a r

ps: still taking suggestions!
rogue_ronin is offline   Reply With Quote
Reply

Tags
html, linux, notetab, script, utility

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Where are the command line tools? PaulChernoch Calibre 17 10-23-2009 12:08 PM
Why use the command line? slantybard Calibre 6 07-22-2009 12:17 PM
Imp scripts and wine linux related derrell Fictionwise eBookwise 12 10-31-2008 04:53 PM
calibre command line utilities and calibre defaults astrodad Calibre 2 08-07-2008 03:27 PM


All times are GMT -4. The time now is 10:49 AM.


MobileRead.com is a privately owned, operated and funded community.