GuteBook (version 0.5) Copyright (C) 2009 Nick Rapallo (nrapallo)


Retrieves the specified Project Gutenberg file, unzips it and filters it.
Provide the PG Etext number and it will try and download the relevant
HTML (or text) version.  Alternatively, you can specify a previously
downloaded ZIP file or an already extracted PG HTML file.
  e.g. gutebook.pl 17297
  e.g. gutebook.pl http://www.gutenberg.org/files/17297/17297-h.zip
  e.g. gutebook.pl c:\dl\17297-h.zip (not yet fully functional)
  e.g. gutebook.pl c:\dl\17297-h\17297-h.htm (not yet fully functional)


Usage: gutebook.pl [options] [Project Gutenberg EText-No. | link to ZIP|HTML]
where [options] include:

  -a, --author   "AUTHOR"   override the Author name detected
		Usually detected within PG preamble.

  -t, --title    "TITLE"    override the Book title detected
		Usually detected within PG preamble.

  -c, --category "CATEGORY" override default "Project Gutenberg"
		Refer to catalog.rdf.html for possible Subjects/Categories to use.

  -h, --help    command line help screen (also seen with no parameters)

 Input/Source:
  --PGnum #     override EText-No. detection if no # in input file name
		Number above 10000 work well, but some numbers below 5000 may not have any
		retrievable files.  In that case, specify the number here with this switch 
		and provide the URL link to the text version

  --keepzip     keep PG .zip file downloaded (local/cache copy for re-edits)
		Avoids retrieving the same source .zip file from the PG website when 
		different GUI HTML fixes/options are tried.
		
  --keephtm     keep PG .htm file extracted from downloaded .zip (or .txt)
		The original .htm o refer to when things go wrong with any substitutions.

  --usegm       use GutenMark for internal .txt to .htm; otherwise abort
		Must install this software to allow for .txt to be transformed into .htm.
		Future versions may replace this with an internal routine.

 Output formats:
 (any or all)

  --1150 --1200 eBookwise .imp created by eBook Publisher
		Can also use a new dos executable called Html2IMP.exe or html2imp.pl.
		Must be manually placed in path or directory where batch file resides.

  --1100        Rocket eBook .rb created by eBook Publisher
  		Can also use rbmake (available from http://rbmake.sourceforge.net/)
  
  --epub --lrf  Sony PRS .epub/.lrf created by calibre

  --mobi        Mobipocket .mobi created by calibre
  		Can also use mobigen or Mobipocket Creator (needs to install manually)
  		
  --lit         Microsoft .lit created by calibre
  
  --srcepub     single .xhtml (non-Sony) .epub created by calibre
  		Can be used for a master/retention copy
  		
  --pdb         eReader .pdb created by calibre

  --zip         Not yet implemented - reserved for calibre .zip

 Output options:
  --outdir DIR  specify DIR where converted ebooks placed; default install directory

  --nobatch     do not created dos batch file for later re-edits
		Specify this if you will not need to re-edit the modified .htm produced.

  -v, --verbose printout messages about this conversion
 		Also prints out the output from the conversion programs used to create
		ebooks from modified .htm
 
 --debug        printout more detailed messages about the conversion

 HTML options:
  --LRmargins $ specify overall <body> left/right margins; default $="2%"
  		$ represents a string
  		
  --indent  $   specify overall <body> para. text indents; default $="2em"
  		$ represents a string
  		
  --fixpre1 $   suffix for  <pre> for .mobi; default $="<small><tt>"
  		$ represents a string
  		
  --fixpre2 $   prefix for </pre> for .mobi; default $="</tt></small>"
  		$ represents a string
  		
  -p, --pb  $   pagebreaks on max. 2 HTML Tags, like $="h1 h2"; default $="h2"
  		$ represents a string separating two HTML tags like "h3 h4". 
		For no pagebreaks, jut prefix it with a period like ".h2"
		
  --nojustify   specify no <body> justification; default is justified text

  --nopara      specify no <body> para. separation; default is blank line sep.

  --pbwithin    pagebreak tags within anchor links to Chapter headings (mobi)

  --pbnofirst   ignore pagebreak on first pagebreak HTML Tag

  --pbfirsth1   force pagebreak on first <h1>

  --pbtoc       force pagebreak at TOC location

  --tocname     omit the "toc" anchor name inserted before TOC (.mobi/.imp)

  --noPGtrailer do not insert PG trailer (Booktitle/Author/Released/EText-No.)

  --PGheader    retain PG header (preamble); default is to strip it out

  --PGfooter    retain PG footer (legalese); default is to strip it out

  --PGpagenum   retain/display PG page numbers; default is to strip them out

  --imgsrc      strip all except "src=" within <img> tags
  		Also removes "width" elements from within preceding <div class=figcenter>
  		which causes images not to be centered in .epub's.

  --centerh     force all <h1> to <h6> tags to be centered.
  
  --smallerfont specify overall <body> text a font-size smaller
  
  --largerfont  specify overall <body> text a font-size larger
  
  --search      Custom Perl RegEx search string expression; use \" for any "
  
  --replace     Custom Perl RegEx replace string expression; use \" for any "
  
  --modi        Custom Perl RegEx "i" modifier for case indifferent
  
  --modg        Custom Perl RegEx "g" modifier for global replacements
  
  --noimgfix    Not yet implemented - do not re-save images for compatibility
  
  --cover       Not yet implemented - extract "cover image" into new cover.htm
		Temporarily used to print a text PG title page with specific metadata
		(would be better to take a snapshot of this as a "cover" image)

  --addtoc      Not yet implemented - create TOC from pagebreak Tags
  
  --addtocend   Not yet implemented - place created TOC above at end

  
PREREQUISITES:
  - Requires calibre installed (for .epub/.lrf/.mobi/.lit) (see http://calibre.kovidgoyal.net/download )
    If (blank) file called 'calibreold' (no .ext) exists in install directory,
    then use v0.5 (stable) calibre instead of new v0.6 (beta/release) calibre
  - Requires eBook Publisher installed (for .imp/.rb) (see http://www.ebooktechnologies.com/support_publisher_download.htm )
  - Optionally converts text files if Gutenmark is installed and in path (see http://www.sandroid.org/GutenMark/download.html )

REVISIONS:
  v0.5 - June 22, 2009
  - If (blank) file called 'calibreold' (no .ext) exists in install directory,
  then use v0.5 (stable) calibre instead of new v0.6 (beta/release) calibre
  - better allowed installation to different location than default "C:\Program Files".
  - improved direct download of PG Australia ebooks.  Allowed local cached copy to
  be retained using --keepzip or --keephtm; avoids subsequent PGA website downloads.
  - implemented creation of eReader .pdb when using calibre v0.6 (beta)
  - fixed handling of single dash ("-") options
  - improved print statements feedback
  - better handling of PGA metadata within .htm
  - better handling of important/necessary text after "THE END" but before PGA blurb.
  - allowed existing .txt CHARSET to be used for generated .htm meta content-type
  - better handling of --pbnofirst when <h1> already used as a pb tag
  - misc. PGA .htm fixes for color and removed fixed fontsize for <p> and <table>

  v0.4 - June 10, 2009
  - added ability to directly download PG Australia ebooks using their
  EText-No. AND URL link to the .html placed as the Input file.
  For example,
  use:  --PGnum 1547A & http://gutenberg.net.au/ebooks07/0700941h.html
  Note that downloading .zip is fine, but .txt is not yet fully functional
  - improved Custom Perl Search and Replace functionality.  Still need to 
  use "\ for any " however due to dos limitation can't use ^ yet!
  - minor code/html fixes.
  
  v0.3 - June 4, 2009
  - also add "start" anchor when the PG preamble is retained
  - also remove any stray <br>'s from metadata.
  - fixes GUI options loading; now properly remembers the search and replace strings.
  Ensure any " or / are escaped by \.
  - simple PG title page added when --cover specified (would be better to take a
  snapshot of this as a "cover" image)
  - option -imgsrc (GUI: 'Extract cover') now also removes "width" elements from within
  preceding <div class=figcenter> which caused images not to be centered in .epub's
   
  v0.2 - June 3, 2009
  - removed unwanted blank page at start in .lrf caused by use of tags '<pre></pre>'
  - minor GUI / files cleanup
  
  v0.1 - June 2, 2009
  - initial public release