Quote:
Originally Posted by Ryn
All these books have large and essential indexes that need to make it into the epub, preferably linked.
|
In ebooks, "Real Page Numbers" (RPNs) are a pain and aren't as helpful as they seem...
And Linked Indexes, to do them properly, requires a
massive amount of manual intervention/cleanup.
Over the years, Me + Hitch have written an enormous amount on these two topics. See some of the latest discussion from earlier this year:
(And if you want to know more about RPNs + Indexes in ebooks, read/follow all those links to all the other threads where we cover every pro/con from every angle.)
Quote:
Originally Posted by Ryn
As the indexes were created by hand, and outside of indesign, that program does not know how to make them active. It would be really nice if indesign knew how to link page numbers, which it self-creates, to a list of page numbers in the same document. Sadly, it does not.
|
Even if you created InDesign Indexes, InDesign still doesn't export all the proper page numbers (pageList + page-list) to EPUB.
Back in 2016, David Kudler asked a similar question in
"Getting InDesign to export pagelists to ePub3 (reflowable)".
I pointed to a 2015 article written by Joshua Tallent
"How to Add a page-list to an EPUB" (now dead, so here's an Archive.org backup) + a 2015 article from EPUBSecrets,
"Page List: All the Cool Ebook Developers Are Doing It".
To my knowledge, not much has really changed since.
Many of those methods require you to put some sort of tag/character at the end of each page, then convert that using some outside tool, then manually link all the Index links.
Just a few months ago, I wrote one such method in
"Create index on epub from printed book".
Quote:
Originally Posted by Ryn
[...] This gives me the page numbers, which are carried with all the other header and footer stuff into the word file.
I then did the following:
|
Great, great... I'll give some rolling commentary on your steps.
Quote:
Originally Posted by Ryn
1. Exporting the Word document to ePub with OpenOffice Writer's writer2ePub plugin. This yields an epub with for each page an xhtml document.
|
Okay. A page-per-XHTML-file might be a good way to leave some sort of marker on every page.
Then you could potentially use the method I pointed out above in
"Create index on epub from printed book", then use
Doitsu's "Incremental IDs plugin":
Chapter01:
Code:
<h1>Chapter 1</h1>
<p>This is the beginning.</p>
<p class="pagenum">1</p> <--- Footer exported from the original document.
Convert with Incremental IDs (or Regex):
Code:
<h1>Chapter 1</h1>
<p>This is the beginning.</p>
<span epub:type="pagebreak" id="page1" title="1"/>
Move Top:
Code:
<span epub:type="pagebreak" id="page1" title="1"/>
<h1>Chapter 1</h1>
<p>This is the beginning.</p>
Note: That <span epub:type> format is EPUB3. EPUB2 you would use slightly different code.
Quote:
2. In Sigil, regexing the page numbers of the book to a self-closing <a> tag with the proper id - like pxxx - and moving them into the topmost paragraph tag of the page.
|
That works too... If you are able to export reliable headers/footers.
Although that's usually a lot of extra cruft you usually have to clean up and sift through.
Depending on the document, it may be best to trash all the header/footers (or not export them at all), then renumber from scratch using some other tools.
Quote:
3. Then merging all the pages that are not the index into one xhtml file. Doing the same for the pages that are part of the index.
|
Yep, this is great. Let Sigil/Calibre do the hard work for you:
1. You merge the entire book into one or two monolithic XHTML file/s.
Let's call them:
2. You can then add your <a>s around all your page numbers in your Index:
Index (Before):
Index (After Regex):
Code:
<p>Dogs, <a id="index-dog-1" href="../Text/merged.xhtml#page1">1</a></p>
3. Then use Sigil to split the XHTML files into their individual chapters, and all the anchors will be automatically updated to the files/chapters they belong:
Code:
<p>Dogs, <a id="index-dog-1" href="../Text/Chapter01.xhtml#page1">1</a></p>
Quote:
4. Iterating through the index file, using regexes to find page numbers and link them to the proper anchor inside the book document. There are ins and outs to this, that I will gloss over here.
|
Yeah, it's a complicated mess.
Forms like "381–385" vs. "381–5" OR "385
n10".
Indexes are extremely information dense and come in many variations, and usually it's not just a simple page number.
I went into some details in the "Create Index [...]" topic above. (Plus definitely in the famous "Real Page Numbers" topics.)
Quote:
5. Finally, splitting the book up into its logical chapters - usually one xhtml file per chapter, same with the notes etc. Thankfully, Sigil knows how to manage the links, once xhtml files are broken up.
|
And once you linkify the Index, just be careful of the ~300KB soft filesize limit for EPUB. Sometimes the Indexes get so large, you have to split them into 2 or more files.
Quote:
6. Sadly, bc of the PDF export, there is a lot of cleaning still to do, as hyphenation is not understood by acrobat's PDF reading and exporting system. Also, there will be headers and/or footers, lots of unwanted hard and soft returns, whitespace, and no styles except (hopefully) italics, bold, super and subscripts.
|
Yep, having a clean source document is the most important step.
I discussed a lot of that back in 2019,
"Workflow for simultaneous EPUB and PDF production?"
Quote:
Also, the index now only links to the page, and not to the proper paragraph or sentence, which might be possible in an ePub, but is now beyond the pale for this project.
|
... yeah, down to the sentence/paragraph-level ain't happening any time soon... Me + Hitch have also discussed that one to death.
It would require the Indexer to have access to the actual source files + the perfect mix of skills that very few are even equipped for.
Page numbers and/or Chap.Subchap are about as good as you're going to get.
Side Note: For just a piece of that discussion, look at my 2016
Post #129 from "Sick of Amazon Kindle books without Page Numbers...". I came up with this concept of "Format-Specific" and "Format-Neutral", and I still think it's a genius analysis.