splitting html files?

NASCARaddicted · 01-20-2013, 06:32 AM

Hello, I hope you people can help me.

I want to convert a html file into an epub manually, without a converter like calibre (I love calibre, but I want to learn how to convert it by myself).

I know, it is recommended to split the html file into multiple parts (especially for older, slower ereaders). I could do it with cut and copy, but this becomes tedious on big files. Is there a program that does the splitting automatically? I want to split the html file at a certain tag (like div class"xxx" or "h2").

I already found a small program called HTML Splitter (from around 2004). Basically, this program does what I want, but there is a problem. At the end, this program ads an unwanted "br". Also, the closing tags "body" and "html" (and the unwanted br tag) are written in upper case. But in xhtml they have to be lower case, so of course, the outcoming html parts are not xhtml valid.

Is there another program that does the same? Just splitting a xhtml file into mutliple xhtml files at a certain tag?

Thanks in advance.

mrmikel · 01-20-2013, 07:55 AM

Why not just use Sigil?

Press control enter at the end of each chapter, just before the following <h2> tag. It is also possible to do this through search and replace adding <hr class="sigil_split_marker" /> Then choose edit, split at markers.

Either way, work on a saved copy.

NASCARaddicted · 01-20-2013, 08:38 AM

maybe I missed something, but as far as I know, in Sigil you can save a file only as epub? But I want to be able to save it as html.

DiapDealer · 01-20-2013, 09:15 AM

An epub is just a zipfile full of html files (among other things). Use Sigil to split the html file the way you want it, and then unzip the epub and snag the html files. You may have to fix some links afterward. I have to say, though, that that seems like a very long driveway to a small and rather unimpressive house.

You'll spend a lot of time looking for tools that will "automatically" help you construct an epub by hand.

mrmikel · 01-20-2013, 09:16 AM

That is true, you can only save as epub in Sigil. But epub is nothing more than a collection of html files and their associated images all zipped together.

In Sigil if you right click on any of these files you can select open with and open in any other editor you like.

Or you can use a zip program to open the epub and work with the files in any program you like...but you need to make sure they are zipped up in certain order with certain files not zipped...which ones escapes me now. There is a tweak epub program which facilitates this and it is built into calibre.

Sorry to repeat... DiapDealer got in first!

If Sigil makes things too simple, you can stay in code view in Sigil and muck about in the html all you like. For me, I work in both views - code view to tweak and book view to preview. It is easier for me to join broken sentences in book view than code view.

meme · 01-20-2013, 12:42 PM

You can also right click on any file or files in Sigil and use Save As to export them if you want to avoid unzipping.

dgatwood · 01-20-2013, 04:00 PM

If you have a Perl interpreter, you could do something like this:

Code:

#!/usr/bin/perl

$/ = undef;

my $filename = $ARGV[0];

open(INPUT, "<$filename");
my $data = <INPUT>;
close(INPUT);

my @parts = split(/<splitmarker>/, $data);

my $count = 1;
for my $part (@parts) {
    open(OUTPUT, ">outfile_$count.html");
    print OUTPUT $part;
    close(OUTPUT);
    $count++;
}

Save it as split.pl, change "<splitmarker>" to match what you're splitting on, change the output filename if you want (currently outfile_1.html, outfile_2.html, .. outfile_n.html), and then run "split.pl mybook.html" or whatever.

You'll want to then go back and add the starting and ending <html> tags, <head> tags, etc. from the first file to each of the other files.

neufsix · 01-21-2013, 09:37 PM

On linux you can use csplit.

Jellby · 01-22-2013, 05:13 AM

csplit alone will not output correct HTML files, as they will be missing the header and final closing tags. But I use csplit for all my books, this is what I do:

1. Put the whole book (at least the main part, title page, notes, etc. can be done separately) in a single XHTML file. Format as desired.

2. Add the head stuff before each chapter, i.e. something like:

Code:

</body>
</html>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:ops="http://www.idpf.org/2007/ops" xml:lang="en">
<head>
  <title>Chapter IV</title>
  <link href="css/style.css" type="text/css" rel="stylesheet" />
</head>
<body>

3. Now use csplit:

Code:

csplit /encoding/ {*}

This splits at every ({*}) appearence of the string "encoding", which is uncommon enough to usually give no problem. Then rename and move the resulting files (xx00, xx01, ...) to their final location. This part can be done with a script.

01-20-2013, 06:32 AM	#1
NASCARaddicted Addict Posts: 340 Karma: 43106 Join Date: Apr 2009 Location: Germany Device: BeBook One, Pocketbook Touch, Pocketbook Touch HD	splitting html files? Hello, I hope you people can help me. I want to convert a html file into an epub manually, without a converter like calibre (I love calibre, but I want to learn how to convert it by myself). I know, it is recommended to split the html file into multiple parts (especially for older, slower ereaders). I could do it with cut and copy, but this becomes tedious on big files. Is there a program that does the splitting automatically? I want to split the html file at a certain tag (like div class"xxx" or "h2"). I already found a small program called HTML Splitter (from around 2004). Basically, this program does what I want, but there is a problem. At the end, this program ads an unwanted "br". Also, the closing tags "body" and "html" (and the unwanted br tag) are written in upper case. But in xhtml they have to be lower case, so of course, the outcoming html parts are not xhtml valid. Is there another program that does the same? Just splitting a xhtml file into mutliple xhtml files at a certain tag? Thanks in advance.

01-20-2013, 09:15 AM	#4
DiapDealer Grand Sorcerer Posts: 28,926 Karma: 207182180 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	An epub is just a zipfile full of html files (among other things). Use Sigil to split the html file the way you want it, and then unzip the epub and snag the html files. You may have to fix some links afterward. I have to say, though, that that seems like a very long driveway to a small and rather unimpressive house. You'll spend a lot of time looking for tools that will "automatically" help you construct an epub by hand. Last edited by DiapDealer; 01-20-2013 at 09:17 AM.

01-20-2013, 09:16 AM	#5
mrmikel Color me gone Posts: 2,089 Karma: 1445295 Join Date: Apr 2008 Location: Central Oregon Coast Device: PRS-300	That is true, you can only save as epub in Sigil. But epub is nothing more than a collection of html files and their associated images all zipped together. In Sigil if you right click on any of these files you can select open with and open in any other editor you like. Or you can use a zip program to open the epub and work with the files in any program you like...but you need to make sure they are zipped up in certain order with certain files not zipped...which ones escapes me now. There is a tweak epub program which facilitates this and it is built into calibre. Sorry to repeat... DiapDealer got in first! If Sigil makes things too simple, you can stay in code view in Sigil and muck about in the html all you like. For me, I work in both views - code view to tweak and book view to preview. It is easier for me to join broken sentences in book view than code view. Last edited by mrmikel; 01-20-2013 at 09:20 AM.

01-20-2013, 04:00 PM	#7
dgatwood Curmudgeon Posts: 629 Karma: 1623086 Join Date: Jan 2012 Device: iPad, iPhone, Nook Simple Touch	If you have a Perl interpreter, you could do something like this: Code: #!/usr/bin/perl $/ = undef; my $filename = $ARGV[0]; open(INPUT, "<$filename"); my $data = <INPUT>; close(INPUT); my @parts = split(/<splitmarker>/, $data); my $count = 1; for my $part (@parts) { open(OUTPUT, ">outfile_$count.html"); print OUTPUT $part; close(OUTPUT); $count++; } Save it as split.pl, change "<splitmarker>" to match what you're splitting on, change the output filename if you want (currently outfile_1.html, outfile_2.html, .. outfile_n.html), and then run "split.pl mybook.html" or whatever. You'll want to then go back and add the starting and ending <html> tags, <head> tags, etc. from the first file to each of the other files.

01-22-2013, 05:13 AM	#9
Jellby frumious Bandersnatch Posts: 7,571 Karma: 20150435 Join Date: Jan 2008 Location: Spaniard in Sweden Device: Cybook Orizon, Kobo Aura	csplit alone will not output correct HTML files, as they will be missing the header and final closing tags. But I use csplit for all my books, this is what I do: 1. Put the whole book (at least the main part, title page, notes, etc. can be done separately) in a single XHTML file. Format as desired. 2. Add the head stuff before each chapter, i.e. something like: Code: </body> </html> <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xmlns:ops="http://www.idpf.org/2007/ops" xml:lang="en"> <head> <title>Chapter IV</title> <link href="css/style.css" type="text/css" rel="stylesheet" /> </head> <body> 3. Now use csplit: Code: csplit /encoding/ {} This splits at every ({}) appearence of the string "encoding", which is uncommon enough to usually give no problem. Then rename and move the resulting files (xx00, xx01, ...) to their final location. This part can be done with a script.

01-20-2013, 07:55 AM	#2
mrmikel Color me gone Posts: 2,089 Karma: 1445295 Join Date: Apr 2008 Location: Central Oregon Coast Device: PRS-300	Why not just use Sigil? Press control enter at the end of each chapter, just before the following <h2> tag. It is also possible to do this through search and replace adding <hr class="sigil_split_marker" /> Then choose edit, split at markers. Either way, work on a saved copy.

01-20-2013, 08:38 AM	#3
NASCARaddicted Addict Posts: 340 Karma: 43106 Join Date: Apr 2009 Location: Germany Device: BeBook One, Pocketbook Touch, Pocketbook Touch HD	maybe I missed something, but as far as I know, in Sigil you can save a file only as epub? But I want to be able to save it as html.

01-20-2013, 12:42 PM	#6
meme Sigil developer Posts: 1,274 Karma: 1101600 Join Date: Jan 2011 Location: UK Device: Kindle PW, K4 NT, K3, Kobo Touch	You can also right click on any file or files in Sigil and use Save As to export them if you want to avoid unzipping.

01-21-2013, 09:37 PM	#8
neufsix Connoisseur Posts: 57 Karma: 1010 Join Date: Jul 2011 Device: Archos A70 eReader, Kindle Touch, Sony PRS-T2	On linux you can use csplit.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
How To Stop It From Splitting HTML Files?	Ransom	Calibre	8	06-12-2011 03:08 PM
Splitting .prc (and .mobi files)	maddz	Other formats	2	12-12-2010 07:02 PM
Does splitting EPUB among more HTML files improve Performance?	purcelljf	ePub	2	10-01-2010 02:15 AM
Splitting the Bible into Multiple Files	SciFiGal777	Ectaco jetBook	3	03-27-2010 10:35 PM
Splitting files... or something?	Angie	Calibre	4	09-14-2009 08:42 PM

Advert

Advert