![]() |
#16 | |
Reticulator of Tharn
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 618
Karma: 400000
Join Date: Jan 2007
Location: EST
Device: Sony PRS-505
|
Quote:
If you mean the former, then there's nothing to be done really, other than use a fixed-page-oriented format like PDF instead. If you mean the latter, that's where the Adobe page-map stuff will help you out[*]. You'll need to do some custom modification of your markup and/or conversion process, but I'd be willing to at least help get you started. * Although googling for more info on it revealed that it's given some IDPF people a case of the hissy fits and they want it to die die die. And I can't really blame them, as it seems that NCX already supports almost exactly the same information. |
|
![]() |
![]() |
![]() |
#17 | ||
Reticulator of Tharn
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 618
Karma: 400000
Join Date: Jan 2007
Location: EST
Device: Sony PRS-505
|
Quote:
Quote:
|
||
![]() |
![]() |
Advert | |
|
![]() |
#18 | |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,410
Karma: 27757236
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Quote:
I suspect the reason for disallowing scripting in the current iteration of the spec is simply to not set the bar too high for viewers. But who knows... |
|
![]() |
![]() |
![]() |
#19 |
Member
![]() Posts: 22
Karma: 10
Join Date: Dec 2008
Device: Sony PRS-700
|
Yes, I'm talking about the page numbers AdobeDE uses to delimit the text rather than pages as "screens full of text" (which obviously change when you change font sizes, though the location of page numbers within the document does not).
AdobeDE is turning the html OCR from 4 scanned pages into 5 pages of ePub. I want to figure out if there's a way to build a document (from html OCR-source) where those 4 pages end up as a 4 page document and the page breaks are where the original page breaks were. Currently those page breaks are denoted as <hr> in the html output from the OCR. I'm not using 4 or 5 page documents but rather 2000 and 3000 page reference manuals. I want the ability to go to page 1773 within the document in the reader and read the same sentence that would be on the top of page 1773 of the scanned paper. And the pages in those manuals contain too much text to read on a single Reader page unless it was at 6pt font, so i want the ability to read a few screens full of text for a single page of scanned input, and then, without any blank space, start the next page of scanned input (with the appropriate page number in the right margin). I know that nearly-absolute page break (page content) control is a feature of PDF. But PDF is so inefficient and slow and ABBYY Finereader's HTML output of the OCR is much much better in reflowable formatting than the PDF output. I'm reading the ePUB best practices document pages on page map with interest, but think i'm a little bit over my head in terms of implementation. thanks very very much for any further guidance. |
![]() |
![]() |
![]() |
#20 | |
Reticulator of Tharn
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 618
Karma: 400000
Join Date: Jan 2007
Location: EST
Device: Sony PRS-505
|
Quote:
Can you tell from examining the HTML ABBYY FineReader produces how it's indicating the beginning/end of pages? If it has a standard, simple way of doing it, I might write a general-purpose tool for adding the page-map (and/or NCX pageList). |
|
![]() |
![]() |
Advert | |
|
![]() |
#21 |
Member
![]() Posts: 22
Karma: 10
Join Date: Dec 2008
Device: Sony PRS-700
|
Alas, no programming languages, but I'm getting a little better at adapting found code as a template.
The form of the ABBYY output is very straightforward.... <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html> <head> <meta http-equiv="content-type" content="text/html; charset=WINDOWS-1252"> <meta name="generator" content="ABBYY FineReader 9.0"> <meta name="author" content=""> <meta name="description" content=""> <meta name="keywords" content=""> <title></title> <style type="text/css"> table.main {} tr.row {} td.cell {} div.block {} div.paragraph {} .font0 { font:6.00pt "Arial", sans-serif; } .font1 { font:40.00pt "Arial", sans-serif; } .font2 { font:5.00pt "Arial Narrow", sans-serif; } .font3 { font:6.00pt "Arial Narrow", sans-serif; } .font4 { font:7.00pt "Arial Narrow", sans-serif; } .font5 { font:8.00pt "Arial Narrow", sans-serif; } .font6 { font:11.00pt "Arial Narrow", sans-serif; } .font7 { font:12.00pt "Arial Narrow", sans-serif; } .font8 { font:13.00pt "Arial Narrow", sans-serif; } .font9 { font:15.00pt "Arial Narrow", sans-serif; } ....... </style> </head> <body> <p></p> <p><span class=font9>CHAPTER I</span></p> <p><span class=font9>text</span></p> <p><span class=font9>text</span></p> <p><span class=font9>text</span></p> <p><span class=font9>text</span></p> <hr> <p></p> <p><span class=font9>CHAPTER I</span></p> <p><span class=font9>text</span></p> <p><span class=font9>text</span></p> <p><span class=font9>text</span></p> <p><span class=font9>text</span></p> <hr> <p></p> <p><span class=font9>CHAPTER I</span></p> <p><span class=font9>text</span></p> <p><span class=font9>text</span></p> <p><span class=font9>text</span></p> <p><span class=font9>text</span></p> <hr> <p></p> <p><span class=font9>CHAPTER 2</span></p> <p><span class=font6>text</span></p> <p><span class=font3>text</span></p> <p><span class=font4>text</span></p> <p><span class=font2>text</span></p> <hr> <p><span class=font9>text</span></p> <p><span class=font8>text</span></p> <p><span class=font4>text</span></p> <p><span class=font9>text</span></p> <hr> Thus: that would be pages 1-5. Each chapter begins with <p></p> each page break is represented by <hr> that's it. |
![]() |
![]() |
![]() |
#22 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,410
Karma: 27757236
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
@llasram If you want to do this, the best way would be to add another option to html2epub --page-boundaries that would accept an XPath selector
|
![]() |
![]() |
![]() |
#23 |
Enthusiast
![]() ![]() ![]() Posts: 33
Karma: 264
Join Date: Mar 2009
Device: Sony PRS-505, Amazon Kindle2, Palm, iPhone
|
Hellooooo
![]() i also need some help with "page-map"... i've read the Best Practice ePub and generally i understand how it works. BUT i would like to know where the code had to been put ?! I think it will be an own file, but how should it be named ? what's the filetype of this thing ? if any1 know... i would be very happy for an answer. thanks ant sorri foa mey bed englisch :P |
![]() |
![]() |
![]() |
#24 |
Junior Member
![]() Posts: 2
Karma: 10
Join Date: Oct 2009
Location: New York, US
Device: none
|
Sorry to reopen such an old thread... But it seems an ongoing topic
Hope I didn't miss it, but particularly to @Bierkonig, is your interest more to preserve the which page particular content is on, i.e. page 586 should have the same content in the paper and eBook, or (maybe and) that each eBook page displays as a single screenfull? In the first case, you only need to replicate the structure (possibly down to sentence accuracy, e.g. for religious texts like the bible) in some way and then possibly accept that a particular paper page displays as 2 screens. I just think that structural accuracy for reference and display accuracy are two issues... Thoughts welcome! |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Force page breaks in epubs | alexvallette | ePub | 11 | 09-06-2010 07:53 AM |
bookmark issues converting HTML to EPUB | isabellkirsten | Calibre | 0 | 04-09-2010 11:47 PM |
Remove page info from HTML when converting? | JMikeD | Calibre | 5 | 04-04-2010 08:40 PM |
converting multi-page HTML to Mobipocket | shinew | Calibre | 13 | 02-21-2009 01:33 PM |
Problem converting a webpage html to LRF, what program should I use? Long page turns | seajewel | Workshop | 1 | 08-01-2008 06:32 AM |