View Full Version : splitting html files?


NASCARaddicted
01-20-2013, 05:32 AM
Hello, I hope you people can help me.

I want to convert a html file into an epub manually, without a converter like calibre (I love calibre, but I want to learn how to convert it by myself).

I know, it is recommended to split the html file into multiple parts (especially for older, slower ereaders). I could do it with cut and copy, but this becomes tedious on big files. Is there a program that does the splitting automatically? I want to split the html file at a certain tag (like div class"xxx" or "h2").

I already found a small program called HTML Splitter (from around 2004). Basically, this program does what I want, but there is a problem. At the end, this program ads an unwanted "br". Also, the closing tags "body" and "html" (and the unwanted br tag) are written in upper case. But in xhtml they have to be lower case, so of course, the outcoming html parts are not xhtml valid.

Is there another program that does the same? Just splitting a xhtml file into mutliple xhtml files at a certain tag?

Thanks in advance.

mrmikel
01-20-2013, 06:55 AM
Why not just use Sigil?

Press control enter at the end of each chapter, just before the following <h2> tag. It is also possible to do this through search and replace adding <hr class="sigil_split_marker" /> Then choose edit, split at markers.

Either way, work on a saved copy.

NASCARaddicted
01-20-2013, 07:38 AM
maybe I missed something, but as far as I know, in Sigil you can save a file only as epub? But I want to be able to save it as html.

DiapDealer
01-20-2013, 08:15 AM
An epub is just a zipfile full of html files (among other things). Use Sigil to split the html file the way you want it, and then unzip the epub and snag the html files. You may have to fix some links afterward. I have to say, though, that that seems like a very long driveway to a small and rather unimpressive house.

You'll spend a lot of time looking for tools that will "automatically" help you construct an epub by hand. ;)

mrmikel
01-20-2013, 08:16 AM
That is true, you can only save as epub in Sigil. But epub is nothing more than a collection of html files and their associated images all zipped together.

In Sigil if you right click on any of these files you can select open with and open in any other editor you like.

Or you can use a zip program to open the epub and work with the files in any program you like...but you need to make sure they are zipped up in certain order with certain files not zipped...which ones escapes me now. There is a tweak epub program which facilitates this and it is built into calibre.

Sorry to repeat... DiapDealer got in first!

If Sigil makes things too simple, you can stay in code view in Sigil and muck about in the html all you like. For me, I work in both views - code view to tweak and book view to preview. It is easier for me to join broken sentences in book view than code view.

meme
01-20-2013, 11:42 AM
You can also right click on any file or files in Sigil and use Save As to export them if you want to avoid unzipping.

dgatwood
01-20-2013, 03:00 PM
If you have a Perl interpreter, you could do something like this:

#!/usr/bin/perl

$/ = undef;

my $filename = $ARGV[0];

open(INPUT, "<$filename");
my $data = <INPUT>;
close(INPUT);

my @parts = split(/<splitmarker>/, $data);

my $count = 1;
for my $part (@parts) {
open(OUTPUT, ">outfile_$count.html");
print OUTPUT $part;
close(OUTPUT);
$count++;
}



Save it as split.pl, change "<splitmarker>" to match what you're splitting on, change the output filename if you want (currently outfile_1.html, outfile_2.html, .. outfile_n.html), and then run "split.pl mybook.html" or whatever.

You'll want to then go back and add the starting and ending <html> tags, <head> tags, etc. from the first file to each of the other files.

neufsix
01-21-2013, 08:37 PM
On linux you can use csplit.

Jellby
01-22-2013, 04:13 AM
csplit alone will not output correct HTML files, as they will be missing the header and final closing tags. But I use csplit for all my books, this is what I do:

1. Put the whole book (at least the main part, title page, notes, etc. can be done separately) in a single XHTML file. Format as desired.

2. Add the head stuff before each chapter, i.e. something like:

</body>
</html>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:ops="http://www.idpf.org/2007/ops" xml:lang="en">
<head>
<title>Chapter IV</title>
<link href="css/style.css" type="text/css" rel="stylesheet" />
</head>
<body>

3. Now use csplit:

csplit /encoding/ {*}

This splits at every ({*}) appearence of the string "encoding", which is uncommon enough to usually give no problem. Then rename and move the resulting files (xx00, xx01, ...) to their final location. This part can be done with a script.