View Full Version : Prevent pagebreak between two html files


thydere
06-08-2011, 09:26 AM
Hello :),

since my Sony ereader seems to have problems with large html files I've had to break them down to smaller files.
Which brings me to the Problem that ADE puts a hard pagebreak between two text paragraphs located in two consecutive html files (which is fine if each file contains a different chapter, not so if the different files are a technical reason).

An epub with the following opf entries:
<item id="section-1_part1" href="section-1_part1.html" media-type="application/xhtml+xml"/>
<item id="section-1_part2" href="section-1_part2.html" media-type="application/xhtml+xml"/>
[...]
<itemref idref="section-1_part1"/>
<itemref idref="section-1_part2"/>
has a hard page-break between the last page of section-1_part1.html and the first page of section-1_part2.html.

Is there a solution or a hack to prevent page brakes for consecutive html files so the textflow is the same as if those two files would be one?

HarryT
06-08-2011, 09:40 AM
I don't think there's any way of avoiding this, unfortunately. Could you not combine the two HTML files into one to avoid it, or have the break occur at a different place where it wouldn't matter?

thydere
06-08-2011, 10:23 AM
I don't think there's any way of avoiding this, unfortunately. Could you not combine the two HTML files into one to avoid it, or have the break occur at a different place where it wouldn't matter?

Combining the html files would result in the forementioned problem of not being able to open the epub on the sony reader - or any other reader that imposes a filesize limit on epub content.

And as I ran into this problem while writing an epub generation library (my intention is to glue a document parser - i.e. markdown - to an epub generator backend, so an article/book/document can be written in a simple text based document language) you can understand that handpicking the pagebreak, while being the only reasonable solution, might not be a desirable or even viable option for an automated process. :(

Seems I have to live with that inconvenience...

Anyway, thanks for the info Harry.

Toxaris
06-08-2011, 03:05 PM
The limit of one xhtml file is around 300kb uncompressed, let's say 265 kb to be save. If you keep your files to that limit, you should be fine. Most people here will split the files at each chapter. There are books with chapters larger than 300kb, but not that many.

thydere
06-09-2011, 02:37 AM
The limit of one xhtml file is around 300kb uncompressed, let's say 265 kb to be save. If you keep your files to that limit, you should be fine. Most people here will split the files at each chapter. There are books with chapters larger than 300kb, but not that many.

Splitting the content at different Chapters is already done, since it naturally mimicks the design most books follow. I just wanted to have a contingency plan for those documents that have larger - or no - Chapters, which currently is to cut in front of elements whose size combined with the size of the preceeding elements exceed 256kb (thanks for the size info. when I finished the epub part yesterday I initially set the size to 128kb since I noticed it not working with larger files and up to now hadn't taken the time to pin down the exact size).

That being said, please don't get me wrong - it was never my intention to always store content in one big file. Apart from the obvious Chapter pagebreak, it's also good practice to do it for technical considerations - navigating (i.e. directly jumping to specific points) imposes less constraints on the readers hardware if the navigation points are located in smaller files. Which is a good enough reason for me.

But as always there's an exception to the rule: the book Flowers for Algernon from Daniel Keyes contains no pagebreaks at all (I only own the hardcopy, though I'd be in interested how they'd manage in a digital version - if they'd do it that way at all), since the book is organized as a diary of the protagonist.

My intention was to cover those cases as well - if only just on principle alone ;).

HarryT
06-09-2011, 04:52 AM
I guess that all you can really do in that case is look for suitable points at which to split the file. You could split immediately after an image, or immediately before <Hx> tags, for example.

thydere
06-10-2011, 07:25 AM
I guess that all you can really do in that case is look for suitable points at which to split the file. You could split immediately after an image, or immediately before <Hx> tags, for example.

Along the way I probably have to semantically analyze the html content and do something like this, though I will drop that issue for the moment until it really becomes a problem and I have more real world test cases on which to decide how to proceed.

Fortunately I do not want to create a general purpose epub creation program, but a backend library that is intended to be glued to a frontend document parser. The difference is that while a converter like calibre or stanza has to recreate/guess the document structure (with a little help of the user in calibres case), I expect the already created structure together with the sectionized content. The postprocessing work from that point is relative simple: just create the html/toc/stylesheet/image/whatchamacalit files making up the oebps part of the epub from the document structure.

The big work lies mostly with the front end and the processing pipeline in the middle. It takes a text document, runs it through the appropriate parser (markdown in my case, but thats relatively exchangeable as long as there's html + processing instructions at the end), then parses the resulting html looking for xinclude / xml preprocessing directives which describe the further processing of the document (including external sections into the text, resizing images to fit the proper resolution, create images/graphs from inline definitions, include references, run some external program and include the result, cook coffee, whatever). This process (hopefully) generated a plethora of information about the content of the files which will essentially result in the structural metadata which is used by the epub backend to create the ebook - and give it pointers on where exactly to cut the text to pieces.

What that means is that I try to solve the issue by declaring it to be the problem of the person writing the front end parser (uhm... which will actually be me again - I knew there was a hole in my theory :smack:).

Which doesn't mean that html files cannot be preprocessed and used as input - in fact for my first prototype i used a simple xhtml frontend that works similar to what you proposed (creating a content tree by parsing for the hx elements, copying over the dc elements and adapting those that differ in their epub form, ...), tried it on some of the html-ized ebooks in my collection and got some nice results out of it.


Once again, thank you both for your input :).