View Single Post
Old 03-03-2009, 04:25 AM   #1
cklammer
Zealot
cklammer has a complete set of Star Wars action figures.cklammer has a complete set of Star Wars action figures.cklammer has a complete set of Star Wars action figures.cklammer has a complete set of Star Wars action figures.cklammer has a complete set of Star Wars action figures.
 
cklammer's Avatar
 
Posts: 106
Karma: 450
Join Date: Feb 2009
Location: Abu Dhabi, United Arab Emirates
Device: Palm Centro, Acer Aspire One
Mass Batch conversion of HTML-Single-File ebooks to .mobi ebooks

Hi all,

First a warning: Looooooonnnnnggggg post ahead !!!!!!!!!!!!!

As part of my current activities for migrating from Plucker to Mobipocket I was faced to mass convert approx. 500 ebooksfrom single-file-html format to mobipocket .mobi/.prc format. Actually, a lot of the ebooks were in text format, lit format and pdf format originally and were then converted for reading on a Nokia smartphone into text format some time back using tools ABC Amber Lit converter when appropriate.

I did not simply want to drag-and-drop all the text files into the windows mobipocket reader as I want to have at least the title and author tags properly set. Dragging and dropping a bunch of files will not do that - quite the opposite: the file name will be the title of the resulting mobi ebook and the author will either left empty (if you are lucky) or set to some random value (if you are unlucky - depending on your circumstances).

Now I tried to mass convert the text files with mobiperl or mobigen instead but they proved unsuitable for direct conversion with either of those two tools.

So I downloaded "Easy Text to HTML Converter" and batched my ~500 text files for conversion into HTML using said tool's default template. That was slow but steady - the job was finished ~ 24 hours later (with some other unrelated stuff like DVD burning going on the conversion machine). (see also below)

That netted me my ~500 html ebooks now - so far, so good. At this point let me remark not to ever delete your original lit/pdf/any format source ebook files like I did in the past - you never know when you might need them again !!! And don't be cocksure about what can be deleted: I was ....

For the html mass conversion I decided to write a script to achieve this goal. I started out with using mobiperl and ended up using the win32 executable of mobi2html with a single line "Windows Command processor" (cmd) which did converted my html to .mobi files just fine. The only problem was that almost every of the ebooks generated showed up in the list of the Mobipocket windows reader just fine but could not be opened resulting in a file corruption error message. The ebooks concerned were the files generated from the text to HTML conversion using "Easy Text to HTML Converter"'s default template. No twiddling would change this result - so I abandoned mobiperl because it has obvious problems with the shitty/complicated/whatever-it-is HTML generated by "Easy Text to HTML Converter"'s default template.

I recommend for anybody to stay away from "Easy Text to HTML Converter"' based on my experience.

My next approach for mass conversion was to use mobigen. But a opf project file is needed for every ebook to be generated if one wants the author and titles properly set .... I fired up Mobipocket Creator and converted a single HTML file to Mobipocket and looked at the resulting .opf file: To my surprise it was simply XML serialized in a single line text file ... tadaaa. Now I knew that I was almost home free if mobigen could handle the "Easy Text to HTML Converter" output.

I ran mobigen on the opf file generated by Mobipocket Creator and the result was to my delight a "rather usable" Mobipocket ebook which worked in the Mobipocket Windows Reader.

I then wrote a Visual Basic Script for generating appropriate opf files and running mobigen for the conversion.

So this is what I did in the directory where my ebook html files are stored:

(0) Change all file extensions .htm to .html. You can use
Code:
LUPAS Rename 2000
for this task.


(1) Preparation of the HTML files' file names: (This is an optional step) I used "LUPAS Rename 2000" to clean up the file names of my HTML files. This step included for me replacing "_" with white space, replacing sequences of two or more white spaces with a single white space and removing angular brackets in the file names. The result of a this are a bunch of files having file names of the form
Code:
<Author's last name>, <Author's first names>[, <Author's titles] - <Title>.html
Caveat: If your file names contain the string sequences
Code:
%1
,
Code:
%2
or
Code:
%3
at this point you have to remove them at this point before you can proceed with the next step!



(2) Manual creation of a list of ebooks to be converted having the name
Code:
00-booklist.txt
:

Code:
dir /B /O:GNE *.html > 00-booklist.txt
notepad 00-booklist.txt
In notepad replace all occurences of the string ".html" with nothing, save and quit.

This will result in a file 00-booklist.txt where each line contains on ebook entry of the form
Code:
<Author's last name>, <Author's first names>[, <Author's titles] - <Title>

(3) Make sure
Code:
mobigen.exe
is either in your %PATH% or the ebook directory. Make sure that the
Code:
Microsoft Windows Scripting Host
is installed and current. This is definitely an issue for Win9x/Me users, possibly an issue for Win2k users, most likely not an issue for WinXP (even unpatched) users and no issue at all for Vista or Win7 users.
Code:
Microsoft Windows Scripting Host
can be obtained from Microsoft downloads (get at least version
Code:
5.6
or
Code:
5.7
).

(4) Make sure the files
Code:
00-template.opf
,
Code:
00-2mobi.vbs
are in the ebook directory. Put your own cover for the mobipocket e-books to be generated with the name
Code:
00-cover.jpg
into the ebook directory.

(5) In your ebook directory run:

Code:
cscript 00-2mobi.vbs
That's it if you have done everything according to the above procedure. Now you should find an .opf Mobipocket project file and a .mobi Mobipocket ebook file for every html file unless mobigen has a problem with one file or the other.

Here is the script
Code:
00-2mobi.vbs
:

Code:
REM 00-2mobi.vbs: Mass conversion of HTML Pages to Mobipocket
REM Version 0.1/03-FEB-2009
REM Released under the respective current version of the GPL by cklammer

Main()
WScript.Quit 0

Sub Main()
	Const ForReading = 1
	Const ForWriting = 2
	Const ForAppending = 8

	DIM booklistfile
	Dim book
	Dim bindestrich
	Dim author
	Dim title
	Dim opffile
	Dim opftemplate
	Dim opfcontent
	Dim opftemplatefile
	Dim opffilename

	Dim FSO
	Set FSO = CreateObject("Scripting.FileSystemObject")

	Dim oShell
	Set oShell = WScript.CreateObject ("WSCript.shell")

	Set opftemplatefile = FSO.OpenTextFile("00-template.opf", ForReading)
	opftemplate = opftemplatefile.Readline
	opftemplatefile.Close

	Set booklistfile = FSO.OpenTextFile("00-booklist.txt", ForReading)
	Do While (booklistfile.AtEndOfStream = False)
		book = booklistfile.Readline
		bindestrich = instr(book, " - ")
		if bindestrich = 0 or bindestrich = null then
			author = "Unknown"
			title = book
		else
			author = Trim(Left(book, bindestrich - 1))
			title = Trim(Right(book, Len(book) - bindestrich - Len(" - ") + 1))
		end if

		opfcontent = replace(opftemplate, "%1", title)
		opfcontent = replace(opfcontent,  "%2", author)
		opfcontent = replace(opfcontent,  "%3", book & ".html")

		opffilename = book & ".opf"
		Set opffile = FSO.CreateTextFile(opffilename, True)
		opffile.WriteLine(opfcontent)
		opffile.Close()

		oShell.run "mobigen " & """" & opffilename & """", 1, True
	Loop

	booklistfile.Close()
	Set FSO = Nothing
	Set oShell = Nothing
End Sub
You have to cut and paste the above code intonotepad and save the resulting file under the name
Code:
00-2mobi.vbs
in your document directory.

Here is the opf template file
Code:
00-template.opf
:

Code:
<?xml version="1.0" encoding="utf-8"?><package unique-identifier="uid"><metadata><dc-metadata xmlns:dc="http://purl.org/metadata/dublin_core" xmlns:oebpackage="http://openebook.org/namespaces/oeb-package/1.0/"><dc:Title>%1</dc:Title><dc:Language>en</dc:Language><dc:Identifier id="uid">0FC99EFF4B</dc:Identifier><dc:Creator>%2</dc:Creator></dc-metadata><x-metadata><output encoding="Windows-1252"></output><EmbeddedCover>00-cover.jpg</EmbeddedCover></x-metadata></metadata><manifest><item id="item1" media-type="text/x-oeb1-document" href="%3"></item></manifest><spine><itemref idref="item1"/></spine><tours></tours><guide></guide></package>
This file is attached.

The source file for the example is Obama, Barack Hussein - Inaugural Presidential Address. Unpack the html file inside into your ebook document directory and rename it
Code:
Obama, Barack Hussein - Inaugural Presidential Address.html
.

Have fun and good luck,
cklammer
Attached Thumbnails
Click image for larger version

Name:	00-cover.jpg
Views:	1012
Size:	187.0 KB
ID:	24882  
Attached Files
File Type: opf 00-template.opf (642 Bytes, 961 views)
File Type: txt 00-booklist.txt (56 Bytes, 727 views)
File Type: opf Obama, Barack Hussein - Inaugural Presidential Address.opf (747 Bytes, 862 views)
File Type: mobi Obama, Barack Hussein - Inaugural Presidential Address.mobi (89.1 KB, 829 views)

Last edited by cklammer; 03-03-2009 at 04:27 AM. Reason: I fucked up. not enough code tags
cklammer is offline   Reply With Quote