MobileRead Forums - View Single Post - Any way to force page breaks when converting HTML to EPUB

Bierkonig · 01-21-2009, 03:50 PM

Yes, I'm talking about the page numbers AdobeDE uses to delimit the text rather than pages as "screens full of text" (which obviously change when you change font sizes, though the location of page numbers within the document does not).

AdobeDE is turning the html OCR from 4 scanned pages into 5 pages of ePub. I want to figure out if there's a way to build a document (from html OCR-source) where those 4 pages end up as a 4 page document and the page breaks are where the original page breaks were. Currently those page breaks are denoted as <hr> in the html output from the OCR.

I'm not using 4 or 5 page documents but rather 2000 and 3000 page reference manuals. I want the ability to go to page 1773 within the document in the reader and read the same sentence that would be on the top of page 1773 of the scanned paper. And the pages in those manuals contain too much text to read on a single Reader page unless it was at 6pt font, so i want the ability to read a few screens full of text for a single page of scanned input, and then, without any blank space, start the next page of scanned input (with the appropriate page number in the right margin).

I know that nearly-absolute page break (page content) control is a feature of PDF. But PDF is so inefficient and slow and ABBYY Finereader's HTML output of the OCR is much much better in reflowable formatting than the PDF output.

I'm reading the ePUB best practices document pages on page map with interest, but think i'm a little bit over my head in terms of implementation.

thanks very very much for any further guidance.

01-21-2009, 03:50 PM	#19
Bierkonig Member Posts: 22 Karma: 10 Join Date: Dec 2008 Device: Sony PRS-700	Yes, I'm talking about the page numbers AdobeDE uses to delimit the text rather than pages as "screens full of text" (which obviously change when you change font sizes, though the location of page numbers within the document does not). AdobeDE is turning the html OCR from 4 scanned pages into 5 pages of ePub. I want to figure out if there's a way to build a document (from html OCR-source) where those 4 pages end up as a 4 page document and the page breaks are where the original page breaks were. Currently those page breaks are denoted as <hr> in the html output from the OCR. I'm not using 4 or 5 page documents but rather 2000 and 3000 page reference manuals. I want the ability to go to page 1773 within the document in the reader and read the same sentence that would be on the top of page 1773 of the scanned paper. And the pages in those manuals contain too much text to read on a single Reader page unless it was at 6pt font, so i want the ability to read a few screens full of text for a single page of scanned input, and then, without any blank space, start the next page of scanned input (with the appropriate page number in the right margin). I know that nearly-absolute page break (page content) control is a feature of PDF. But PDF is so inefficient and slow and ABBYY Finereader's HTML output of the OCR is much much better in reflowable formatting than the PDF output. I'm reading the ePUB best practices document pages on page map with interest, but think i'm a little bit over my head in terms of implementation. thanks very very much for any further guidance.