View Full Version : Extract html from epub


Waltarro
08-27-2009, 12:37 AM
I got a little tired of manually extracting the html from epub
files when I wanted to just read the book in a browser. Just messing
around with bash I came up with a simple script to do the job.

Its pretty crude and I know I should have read the metadata.opf
and probably would have if I did this in Java or Python, anyway
thought I would share nonetheless. Works in linux, might work
on a mac with a few tweaks. Just pass in the epub file as the first
parameter.


#!/bin/bash

# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.

# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.

bookname=$1
unzip $1 -d /tmp/epub2html > /dev/null

str0=`find /tmp/epub2html/content/* -regex '.*_1.html'`
let len=${#str0}-6

substr=${str0:23:$len}
substr=${substr%1.html}

files=`ls -l /tmp/epub2html/content/$substr*.html | wc -l`

for x in $(seq 0 $files); do

filepart="/tmp/epub2html/content/$substr$x.html"

if [ -e $filepart ]; then
cat $filepart >> ${bookname//.epub/.html}
fi
done

#copy over the images if you want them
if [ ! -e resources ]; then
mkdir resources
fi

`cp /tmp/epub2html/content/resources/*.jpg /tmp/epub2html/content/resources/*.png -t ./resources 2> /dev/null`

rm -R /tmp/epub2html

frabjous
08-27-2009, 12:44 AM
Thanks!

You might also want to check out jellby's script here:
http://www.mobileread.com/forums/showthread.php?t=51267

Waltarro
08-27-2009, 12:55 AM
You're right, I remember reading the post but saw the reference
to javascript so didn't think much of it... On the plus side I did
learn some new things about bash scripting so it wasn't a waste.

Thanks

Jellby
08-27-2009, 05:21 AM
You're right, I remember reading the post but saw the reference
to javascript so didn't think much of it... On the plus side I did
learn some new things about bash scripting so it wasn't a waste.

The javascript is only used to have some kind of "reader" in the browser and to override the epub's CSS. If you only want to extract the XHTML files, you don't need it.