View Full Version : How to automatically split (x)html in epub?


osnova
03-08-2012, 02:13 PM
Let's say I have created an epub by hand with only large one (x)html for text (so, not up to epub spec). This file is very large, where each chapter/section starts with an <h2> or some other consistent tag. Furthermore, this file has thousands of internal hyperlinks.

Question: Is there a tool that would split this (x)html into many files (as many chapters/sections) and modify the internal hyperlinks accordingly? I don't want to have any other changes in the tags though.

mmat1
03-08-2012, 02:35 PM
Let's say I have created an epub by hand with only large one (x)html for text (so, not up to epub spec).

You can do it with sigil.

First exchange "<h2"
to "<hr class="sigilChapterBreak" /><h2"

Then press F6

Done.

btw. "(so, not up to epub spec)" What's this ?? :). How long is it ??

DiapDealer
03-08-2012, 02:42 PM
You can do that with Sigil... however... I would not claim that it will absolutely leave all other tags (not to mention the formatting of the xhtml) alone. Turning off HTMLTidy will minimize the changes made to your code, but quite simply put... stuff's going to get changed (in addition to that which is necessary to accommodate the splitting/link-maintaining).

osnova
03-08-2012, 04:30 PM
You can do it with sigil.

First exchange "<h2"
to "<hr class="sigilChapterBreak" /><h2"

Then press F6

Done.

Thank you so much. I'll try it. The last time I looked at Sigil (was about a year ago), it coughed on such large files. I usually do everything by hand (using emeditor).

btw. "(so, not up to epub spec)" What's this ?? :). How long is it ??

If you look at the link to the OSNOVA List below, you'll see that I tend to make huge book collections and books. For example, a Bible commentary that has 9 huge volumes plus the Bible itself in one file (http://www.amazon.com/dp/B0073GPIFQ/). Or the works of Jonathan Edwards (http://www.amazon.com/Collection-Jonathan-Edwards-OSNOVA-ebook/dp/B004XEKE4Q/). I'd like to convert all my mobi files to epubs as well.

osnova
03-08-2012, 04:35 PM
You can do that with Sigil... however... I would not claim that it will absolutely leave all other tags (not to mention the formatting of the xhtml) alone.

I just want to avoid radical changes that e.g. Calibre does.

mmat1
03-08-2012, 05:55 PM
The last time I looked at Sigil (was about a year ago), it coughed on such large files. I usually do everything by hand (using emeditor).


Fine. Diap Dealer is right, sigil will made some changes. But that's nothing compared to the "formatting" calibre creates.

Epub will be a bit smaller than mobi. I made one (just for curiosity), which has 53Mb. It works at last on a pocketbook-mobile device. up to 20MB seams not to be a critical size.

DiapDealer
03-08-2012, 06:42 PM
Oh yes. There'll be no drastic changes like if you converted with calibre. I didn't mean to imply that. I just didn't know how much of a perfectionist I might be dealing with, so I erred on the side of caution. ;)

Sigil can still be quite sluggish when dealing with very large, single html files (more so in Book View than Code View, but still...), but the more you split it up... the more responsive it becomes. Depending on how huge your file is, it could still be quite painful—and possibly lock up. But I certainly wouldn't be afraid to try it. :)

osnova
03-08-2012, 06:54 PM
I made one (just for curiosity), which has 53Mb

Did you use Sigil for this one?

mmat1
03-09-2012, 03:45 AM
Did you use Sigil for this one?

Yes, it gets slow, but it works

osnova
03-09-2012, 02:18 PM
Just reporting that Sigil worked as you described even with a large file (it took a while though). I wish it were using multithreading because I have many CPU cores and only one (?) was taken up by the process. Anyway, thank you.

JSWolf
03-10-2012, 05:04 PM
Run Sigil without loading any files at all. Turn off Tidy and then load your ePub and do all the splitting. That will do the least amount of harmful changes.

SBT
03-11-2012, 05:05 PM
Under unix-type operating systems (incl. OSX), you could use the csplit command, e.g.
csplit -f "chapters/" -b "%2.2d.xhtml" big_file.xhtml "/<h2/" "{*}"
That'll split your file into chapters/00.xhtml, chapters/01.xhtml, ...
However, everything before the first <h2> tag ends up in 00.xhtml, and the other files lack the enclosing <html><head>...</body></html> tags. Of course, a few shell commands can fix that, but I'll leave that as an exercise to the reader ;)

Toxaris
03-12-2012, 02:19 AM
Under unix-type operating systems (incl. OSX), you could use the csplit command, e.g.
csplit -f "chapters/" -b "%2.2d.xhtml" big_file.xhtml "/<h2/" "{*}"
That'll split your file into chapters/00.xhtml, chapters/01.xhtml, ...
However, everything before the first <h2> tag ends up in 00.xhtml, and the other files lack the enclosing <html><head>...</body></html> tags. Of course, a few shell commands can fix that, but I'll leave that as an exercise to the reader ;)

But this will not help you if you have internal links...

SBT
03-12-2012, 05:20 AM
Well, that's another trivial scripting exercise left for the reader, then :D
( ~10 lines will do it)

SBT
03-12-2012, 07:37 AM
This seems to do the job:

#!/bin/bash
mkdir -p chapters
sed '/<h2/s/^/<\/body>\n<\/html>\n/' $1| sed -n '1,/<body/{1h;1!H};/<h2/{x;p;x};p' | csplit -f "chapters/" -b "%2.2d.xhtml" - "/<\/html>/+1" "{$(( $(grep -c '<h2' $1) - 1 ))}"
cd chapters
for f in ??.xhtml
do for t in $(grep -ho "href=.#[^\"']\+" $f|cut -c8-180)
do sed -i "s/\([\"']\)#${t}['\"]/\1$(grep -l "\(name\|id\)=['\"]${t}['\"]" ??.xhtml|grep -v "${f}")#${t}\1/" $f
done
done

osnova
03-13-2012, 06:00 PM
Thank you, SBT. I'll try your approach as well.