Any way to force page breaks when converting HTML to EPUB - Page 2

llasram · 01-21-2009, 02:24 PM

Quote:

Originally Posted by akash

I have no idea how the reader manages pagination of the text. I know that its possible to insert a page break in an RTF and the Reader will break the page accordingly for a Calibre conversion to ePub.

Do you mean "the screenful of text I see on the Reader, and navigate to the next one via pressing the 'forward' and 'backward' buttons"? Or do you mean "the 'pages' the Reader indicates by the little page numbers in the margins"?

If you mean the former, then there's nothing to be done really, other than use a fixed-page-oriented format like PDF instead. If you mean the latter, that's where the Adobe page-map stuff will help you out[*]. You'll need to do some custom modification of your markup and/or conversion process, but I'd be willing to at least help get you started.

* Although googling for more info on it revealed that it's given some IDPF people a case of the hissy fits and they want it to die die die. And I can't really blame them, as it seems that NCX already supports almost exactly the same information.

llasram · 01-21-2009, 02:29 PM

Quote:

Originally Posted by kovidgoyal

I suspect that's one part of the OPS spec that's going to change. It's rather ridiculuous to not supoprt javascript and in a few years when portable devices are powerful enough to handle javascript, it will make absolutely no sense.

I don't know... Is the "should not" because requiring scripting would limit the devices books could "run" on, or because the IDPF doesn't think scripting is really appropriate for book-like content? Terrible grammar aside, the OPS spec has this bit in section 2.5.1 (General Notes on SVG Usage):

Quote:

OPS supports the full SVG 1.1 Recommendation. The only exception is that since OPS is not targeting interactive content. SVG animation and scripting features are not supported and must not be used by publication authors; a Reading System should not render such content. [italics added]

kovidgoyal · 01-21-2009, 02:37 PM

Quote:

Originally Posted by llasram

I don't know... Is the "should not" because requiring scripting would limit the devices books could "run" on, or because the IDPF doesn't think scripting is really appropriate for book-like content? Terrible grammar aside, the OPS spec has this bit in section 2.5.1 (General Notes on SVG Usage):

Regardless of the IDPF's reasons for doing this, ebooks are going to become truly digital, which means they will become interactive. Game books, alternate storylines, author provided alternate look and feel, all of this is easy to implement using scripting. As people's conception of the ebook moves further and further from the pbook, I predict that the decision to leave out scripting is going to seem more and more short sighted.

I suspect the reason for disallowing scripting in the current iteration of the spec is simply to not set the bar too high for viewers. But who knows...

Bierkonig · 01-21-2009, 04:50 PM

Yes, I'm talking about the page numbers AdobeDE uses to delimit the text rather than pages as "screens full of text" (which obviously change when you change font sizes, though the location of page numbers within the document does not).

AdobeDE is turning the html OCR from 4 scanned pages into 5 pages of ePub. I want to figure out if there's a way to build a document (from html OCR-source) where those 4 pages end up as a 4 page document and the page breaks are where the original page breaks were. Currently those page breaks are denoted as <hr> in the html output from the OCR.

I'm not using 4 or 5 page documents but rather 2000 and 3000 page reference manuals. I want the ability to go to page 1773 within the document in the reader and read the same sentence that would be on the top of page 1773 of the scanned paper. And the pages in those manuals contain too much text to read on a single Reader page unless it was at 6pt font, so i want the ability to read a few screens full of text for a single page of scanned input, and then, without any blank space, start the next page of scanned input (with the appropriate page number in the right margin).

I know that nearly-absolute page break (page content) control is a feature of PDF. But PDF is so inefficient and slow and ABBYY Finereader's HTML output of the OCR is much much better in reflowable formatting than the PDF output.

I'm reading the ePUB best practices document pages on page map with interest, but think i'm a little bit over my head in terms of implementation.

thanks very very much for any further guidance.

llasram · 01-21-2009, 06:03 PM

Quote:

Originally Posted by Bierkonig

I'm reading the ePUB best practices document pages on page map with interest, but think i'm a little bit over my head in terms of implementation.

thanks very very much for any further guidance.

Do you know any programming languages? (Python...?)

Can you tell from examining the HTML ABBYY FineReader produces how it's indicating the beginning/end of pages? If it has a standard, simple way of doing it, I might write a general-purpose tool for adding the page-map (and/or NCX pageList).

Bierkonig · 01-21-2009, 07:12 PM

Alas, no programming languages, but I'm getting a little better at adapting found code as a template.

The form of the ABBYY output is very straightforward....

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=WINDOWS-1252">
<meta name="generator" content="ABBYY FineReader 9.0">
<meta name="author" content="">
<meta name="description" content="">
<meta name="keywords" content="">

<title></title>
<style type="text/css">
table.main {}
tr.row {}
td.cell {}
div.block {}
div.paragraph {}
.font0 { font:6.00pt "Arial", sans-serif; }
.font1 { font:40.00pt "Arial", sans-serif; }
.font2 { font:5.00pt "Arial Narrow", sans-serif; }
.font3 { font:6.00pt "Arial Narrow", sans-serif; }
.font4 { font:7.00pt "Arial Narrow", sans-serif; }
.font5 { font:8.00pt "Arial Narrow", sans-serif; }
.font6 { font:11.00pt "Arial Narrow", sans-serif; }
.font7 { font:12.00pt "Arial Narrow", sans-serif; }
.font8 { font:13.00pt "Arial Narrow", sans-serif; }
.font9 { font:15.00pt "Arial Narrow", sans-serif; }
.......
</style>
</head>

<body>

CHAPTER I
text
text
text
text

<hr>


CHAPTER I
text
text
text
text

<hr>


CHAPTER I
text
text
text
text

<hr>


CHAPTER 2
text
text
text
text

<hr>

text
text
text
text

<hr>

Thus: that would be pages 1-5. Each chapter begins with 

each page break is represented by <hr>

that's it.

kovidgoyal · 01-21-2009, 07:15 PM

@llasram If you want to do this, the best way would be to add another option to html2epub --page-boundaries that would accept an XPath selector

setzer · 04-23-2009, 10:41 AM

Hellooooo

i also need some help with "page-map"...
i've read the Best Practice ePub and generally i understand how it works.
BUT i would like to know where the code had to been put ?!
I think it will be an own file, but how should it be named ?
what's the filetype of this thing ?

if any1 know... i would be very happy for an answer.

thanks ant sorri foa mey bed englisch :P

martin-a · 10-31-2009, 02:51 PM

Sorry to reopen such an old thread... But it seems an ongoing topic

Hope I didn't miss it, but particularly to @Bierkonig, is your interest more to preserve the which page particular content is on, i.e. page 586 should have the same content in the paper and eBook, or (maybe and) that each eBook page displays as a single screenfull?

In the first case, you only need to replicate the structure (possibly down to sentence accuracy, e.g. for religious texts like the bible) in some way and then possibly accept that a particular paper page displays as 2 screens.

I just think that structural accuracy for reference and display accuracy are two issues...

Thoughts welcome!

01-21-2009, 07:12 PM	#21
Bierkonig Member Posts: 22 Karma: 10 Join Date: Dec 2008 Device: Sony PRS-700	Alas, no programming languages, but I'm getting a little better at adapting found code as a template. The form of the ABBYY output is very straightforward.... <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html> <head> <meta http-equiv="content-type" content="text/html; charset=WINDOWS-1252"> <meta name="generator" content="ABBYY FineReader 9.0"> <meta name="author" content=""> <meta name="description" content=""> <meta name="keywords" content=""> <title></title> <style type="text/css"> table.main {} tr.row {} td.cell {} div.block {} div.paragraph {} .font0 { font:6.00pt "Arial", sans-serif; } .font1 { font:40.00pt "Arial", sans-serif; } .font2 { font:5.00pt "Arial Narrow", sans-serif; } .font3 { font:6.00pt "Arial Narrow", sans-serif; } .font4 { font:7.00pt "Arial Narrow", sans-serif; } .font5 { font:8.00pt "Arial Narrow", sans-serif; } .font6 { font:11.00pt "Arial Narrow", sans-serif; } .font7 { font:12.00pt "Arial Narrow", sans-serif; } .font8 { font:13.00pt "Arial Narrow", sans-serif; } .font9 { font:15.00pt "Arial Narrow", sans-serif; } ....... </style> </head> <body> <p></p> <p><span class=font9>CHAPTER I</span></p> <p><span class=font9>text</span></p> <p><span class=font9>text</span></p> <p><span class=font9>text</span></p> <p><span class=font9>text</span></p> <hr> <p></p> <p><span class=font9>CHAPTER I</span></p> <p><span class=font9>text</span></p> <p><span class=font9>text</span></p> <p><span class=font9>text</span></p> <p><span class=font9>text</span></p> <hr> <p></p> <p><span class=font9>CHAPTER I</span></p> <p><span class=font9>text</span></p> <p><span class=font9>text</span></p> <p><span class=font9>text</span></p> <p><span class=font9>text</span></p> <hr> <p></p> <p><span class=font9>CHAPTER 2</span></p> <p><span class=font6>text</span></p> <p><span class=font3>text</span></p> <p><span class=font4>text</span></p> <p><span class=font2>text</span></p> <hr> <p><span class=font9>text</span></p> <p><span class=font8>text</span></p> <p><span class=font4>text</span></p> <p><span class=font9>text</span></p> <hr> Thus: that would be pages 1-5. Each chapter begins with <p></p> each page break is represented by <hr> that's it.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Force page breaks in epubs	alexvallette	ePub	11	09-06-2010 08:53 AM
bookmark issues converting HTML to EPUB	isabellkirsten	Calibre	0	04-10-2010 12:47 AM
Remove page info from HTML when converting?	JMikeD	Calibre	5	04-04-2010 09:40 PM
converting multi-page HTML to Mobipocket	shinew	Calibre	13	02-21-2009 02:33 PM
Problem converting a webpage html to LRF, what program should I use? Long page turns	seajewel	Workshop	1	08-01-2008 07:32 AM

01-21-2009, 04:50 PM	#19
Bierkonig Member Posts: 22 Karma: 10 Join Date: Dec 2008 Device: Sony PRS-700	Yes, I'm talking about the page numbers AdobeDE uses to delimit the text rather than pages as "screens full of text" (which obviously change when you change font sizes, though the location of page numbers within the document does not). AdobeDE is turning the html OCR from 4 scanned pages into 5 pages of ePub. I want to figure out if there's a way to build a document (from html OCR-source) where those 4 pages end up as a 4 page document and the page breaks are where the original page breaks were. Currently those page breaks are denoted as <hr> in the html output from the OCR. I'm not using 4 or 5 page documents but rather 2000 and 3000 page reference manuals. I want the ability to go to page 1773 within the document in the reader and read the same sentence that would be on the top of page 1773 of the scanned paper. And the pages in those manuals contain too much text to read on a single Reader page unless it was at 6pt font, so i want the ability to read a few screens full of text for a single page of scanned input, and then, without any blank space, start the next page of scanned input (with the appropriate page number in the right margin). I know that nearly-absolute page break (page content) control is a feature of PDF. But PDF is so inefficient and slow and ABBYY Finereader's HTML output of the OCR is much much better in reflowable formatting than the PDF output. I'm reading the ePUB best practices document pages on page map with interest, but think i'm a little bit over my head in terms of implementation. thanks very very much for any further guidance.

01-21-2009, 07:15 PM	#22
kovidgoyal creator of calibre Posts: 45,626 Karma: 28549046 Join Date: Oct 2006 Location: Mumbai, India Device: Various	@llasram If you want to do this, the best way would be to add another option to html2epub --page-boundaries that would accept an XPath selector

04-23-2009, 10:41 AM	#23
setzer Enthusiast Posts: 33 Karma: 264 Join Date: Mar 2009 Device: Sony PRS-505, Amazon Kindle2, Palm, iPhone	Hellooooo i also need some help with "page-map"... i've read the Best Practice ePub and generally i understand how it works. BUT i would like to know where the code had to been put ?! I think it will be an own file, but how should it be named ? what's the filetype of this thing ? if any1 know... i would be very happy for an answer. thanks ant sorri foa mey bed englisch :P

10-31-2009, 02:51 PM	#24
martin-a Junior Member Posts: 2 Karma: 10 Join Date: Oct 2009 Location: New York, US Device: none	Sorry to reopen such an old thread... But it seems an ongoing topic Hope I didn't miss it, but particularly to @Bierkonig, is your interest more to preserve the which page particular content is on, i.e. page 586 should have the same content in the paper and eBook, or (maybe and) that each eBook page displays as a single screenfull? In the first case, you only need to replicate the structure (possibly down to sentence accuracy, e.g. for religious texts like the bible) in some way and then possibly accept that a particular paper page displays as 2 screens. I just think that structural accuracy for reference and display accuracy are two issues... Thoughts welcome!

Advert

Advert