![]() |
How to export indexes from indesign to epub
Hi all,
I am involved in a project that involves exporting to epub 15-20 books of about 500 pages each. All these books have large and essential indexes that need to make it into the epub, preferably linked. As the indexes were created by hand, and outside of indesign, that program does not know how to make them active. It would be really nice if indesign knew how to link page numbers, which it self-creates, to a list of page numbers in the same document. Sadly, it does not. As one of the source files was not in indesign, but in QuarkXpress, and in a version which I do not have, I have experimented with loading the PDF into Acrobat and exporting it to a Word file. This gives me the page numbers, which are carried with all the other header and footer stuff into the word file. I then did the following:
It is definitely a half-way solution, but one that can be simplified to some extent within Sigil, by using the Saved Searches functionality. The whole index thing took me about two hours, from start to finish, for a 4k entry index. One book down, fourteen (or more) to go. I am posting this as a hack, but in this community it is likely there are minds that can spot weaknesses in my approach. Feel free to shoot holes, and help me out for the remainder of the series. ------ As an edit: for those who come after, looking for a solution, there are valuable comments in the thread. Outstanding ones from my perspective have been the realization that there are indesign plugins that export the page numbers into epub, here: Quote:
https://www.id-extras.com/products/liveindex/ |
Before you get impatient, Jon, and unilaterally declare this as irrelevant to Sigil... don't. I've chosen to leave it here, as there could be Sigil-specific ideas presented for improvement.
|
Quote:
FootnoteLinker PageList Incremental IDs Also check out all other Sigil plugins. |
Quote:
And Linked Indexes, to do them properly, requires a massive amount of manual intervention/cleanup. Over the years, Me + Hitch have written an enormous amount on these two topics. See some of the latest discussion from earlier this year:
(And if you want to know more about RPNs + Indexes in ebooks, read/follow all those links to all the other threads where we cover every pro/con from every angle.) Quote:
Back in 2016, David Kudler asked a similar question in "Getting InDesign to export pagelists to ePub3 (reflowable)". I pointed to a 2015 article written by Joshua Tallent "How to Add a page-list to an EPUB" (now dead, so here's an Archive.org backup) + a 2015 article from EPUBSecrets, "Page List: All the Cool Ebook Developers Are Doing It". To my knowledge, not much has really changed since. Many of those methods require you to put some sort of tag/character at the end of each page, then convert that using some outside tool, then manually link all the Index links. Just a few months ago, I wrote one such method in "Create index on epub from printed book". Quote:
Quote:
Then you could potentially use the method I pointed out above in "Create index on epub from printed book", then use Doitsu's "Incremental IDs plugin": Chapter01: Code:
<h1>Chapter 1</h1>Code:
<h1>Chapter 1</h1>Code:
<span epub:type="pagebreak" id="page1" title="1"/>Quote:
Although that's usually a lot of extra cruft you usually have to clean up and sift through. Depending on the document, it may be best to trash all the header/footers (or not export them at all), then renumber from scratch using some other tools. Quote:
1. You merge the entire book into one or two monolithic XHTML file/s. Let's call them:
2. You can then add your <a>s around all your page numbers in your Index: Index (Before): Code:
<p>Dogs, 1</p>Code:
<p>Dogs, <a id="index-dog-1" href="../Text/merged.xhtml#page1">1</a></p>Code:
<p>Dogs, <a id="index-dog-1" href="../Text/Chapter01.xhtml#page1">1</a></p>Quote:
Forms like "381–385" vs. "381–5" OR "385n10". Indexes are extremely information dense and come in many variations, and usually it's not just a simple page number. I went into some details in the "Create Index [...]" topic above. (Plus definitely in the famous "Real Page Numbers" topics.) Quote:
And once you linkify the Index, just be careful of the ~300KB soft filesize limit for EPUB. Sometimes the Indexes get so large, you have to split them into 2 or more files. Quote:
I discussed a lot of that back in 2019, "Workflow for simultaneous EPUB and PDF production?" Quote:
It would require the Indexer to have access to the actual source files + the perfect mix of skills that very few are even equipped for. Page numbers and/or Chap.Subchap are about as good as you're going to get. Side Note: For just a piece of that discussion, look at my 2016 Post #129 from "Sick of Amazon Kindle books without Page Numbers...". I came up with this concept of "Format-Specific" and "Format-Neutral", and I still think it's a genius analysis. :D |
INDEXES!
"Niagara Falls, slowly I turn, step-by-step..." OH, NEVER MIND. I was trying to put the Abbott & Costello Niagra Falls video in here. If you want to see it, go here: https://www.youtube.com/watch?v=8KpsUlvzbkk Hitch, frequent Index Victim |
Quote:
|
Quote:
Hitch |
Quote:
|
Quote:
First of all, thanks for your extensive reply. Second, I'm not at all interested in having parity between book page numbers and e-book page numbers - RPNs as you call them. I have tried implementing such things from time to time, but have found reader implementation spotty, and can see little added value for either reader or publisher. You mention some alternatives to the steps I have used; I will investigate them further when I have occasion to do so - it's always good to have multiple paths home :) As the rest of this project consists of indesign files, and I generally don't export those to epub page-by-page, I was considering working from the PDFs that indesign outputs. Converting them to word docs with acrobat, and then exporting those to epub using oowriter seems the way to go to easily get the page numbers. It remains a hassle to clean all the cruft out, though. I's welcome a way to do things more easily through indesign, perhaps using the method you mention where each page gets a special character which Sigil can replace with page-break tags, and then to use the Sigil plugin for serialized ids you mention. As I never use indesign for anything except making epub exports, I would welcome some input as to how to go about this in indesign. It's not my favorite program, although my limited experiences with quark have managed to knock the adobe product off the utmost bottom rank. |
Quote:
Quote:
Remember, a book isn't just pure text, the underlying formatting is just as important. :) PDF is one of the worst input formats there is, and you'll lose much of the original markup + introduce errors and other junk while converting to any other formats. It's almost always better to always go from: Source -> EPUB (Directly) than to do: Source -> PDF -> Word -> EPUB where each step in the chain may introduce more issues. * * * If InDesign File Is Using Styles Great. You're going to have an easier job. In InDesign, there's such a thing as Style Mapping:
If InDesign File Is NOT Using Styles Prepare for pain... :D (This is the more likely scenario, since 99%+ of people who use InDesign/Word/LibreOffice don't know or use Styles when designing documents.) You'll have to manually clean up all the code, and every single book is going to generate wildly different cruft. And boy, oh boy, does InDesign love to generate iBooks-friendly bloat in their CSS. Side Note: On Styles... I've also written about Why/How Styles are so important, most recently:
I think this is #1 the most important step there can be. Clean input helps EVERY single step down the line. If people designed their documents with Styles+Accessibility in mind first, it would make everyone's life much easier. :) (While steps between programs are different, the Styles concept is similar across all.) Quote:
~100% of the InDesign work I get is... directly formatted... so it's a mess. I've only met one designer who actually used InDesign with proper Styles. Quote:
If not using RPNs, then what's the clickable links you're trying to accomplish in the Index? Are you trying to do a: Code:
Cats, [1], [2], [3]* * * But RPNs do serve some purpose, especially for Accessibility reasons (blind readers) + citations, book clubs, etc. And for Linked Indexes, page #s seem to make a lot more sense. Quote:
In your favorite search engine, type: Code:
many-to-one Hitch site:mobileread.com |
Hi Tex. Apologies for my late answer. Somehow, I didn't get notified of your reply, and as I do not visit MR every day, here we are.
I do use indesign for limited things like exporting files to epub, and am familiar with the mechanics and best practices of that route. Of course, it is often way easier to go the direct route than through PDF. Perhaps my question was not as clear as it could have been. What I really wanted to know was: how do you export the page numbers in an indesign export to epub? It's not an option in the export dialogue box, nor is it something I can easily put together using indesign's byzantine search module. Is there another way? It would be nice, seeing as most of the volumes in this project are in indesign. The reason I went the PDF route in the OP was bc that particular volume was created in some quark version I do not possess. |
Quote:
Hitch |
Quote:
Edit: not one index in fact but dozens, in a big project. |
I'm not sure if this is exactly what you need, but I will post a few links that may lead you to come up with your own solution.
http://epubsecrets.com/why-i-use-page-list-and-how.php http://epubsecrets.com/page-list-all...e-doing-it.php The link to the script is dead, so I'm listing it from web archive: http://web.archive.org/web/201912181...orohikoscripts You can write directly to Laura, but I have a feeling you'd better check out the "EPUB Accessibility Using InDesign" video tutorial (available from Lynda.com or Linkedin), which AFAIK includes the PageStaker and EPUBOgrify script. The latter is not so important anyway, because it is a simple change that can be done in Sigil. |
Quote:
You don't. There isn't any easy or magic or automatic way to export the RPNs (Real Page Numbers). You create them manually. You open up the ePUB; you open up the PDF. You find the first page-end. You search for that bit of text--typically, 5-10 characters will do. When you find it, you create the anchor, like P01, P02, etc. Then, after all the anchors are done: then you write a script if you're lucky--or do it manually if you aren't--that links all the index entries that go to page 1, to P01, all the index entries that go to P02, to 2 and so forth. That's it. Knowing Tex, he has some mad coding that will do some of this more easily than I've described, but that's the fundamental process, right there. And it's entirely possible that there are Sigil or Calibre addins that already do the 2nd part, the linking part, that I don't know about, as my band of Merry Minions use our internal, proprietary clips/programs to do that. That's the basic procedure. Hitch |
Quote:
Hitch |
I have not tried this (and frankly even if it works it still sounds pretty miserable) but I wonder if you could work page-by-page unlocking whatever master page element includes the page number, and then use a plugin such as https://www.rorohiko.com/wordpress/i...ds/textstitch/ to auto-thread the page numbers into the document flow.
|
Quote:
:2thumbsup Hitch |
If you have pdf of printed version of book, you should be able to print to a postscript file and use python on that postscript file to extract the page numbers and the first n words and last n words on each page (where n is small say 3) and save that info to a file. Then use sed or some other stream editor with that info to insert the markers you want in each html file.
Some custom programming in python might be needed but should be reusable for future projects. |
Quote:
Quote:
Quote:
Quote:
(Typical Hitch, never reading anything I write! :rofl:) Anyway, it's not anything I've ever done in an actual EPUB, just theoretical musings. All the logic is sound though. :D Luckily, I haven't had to do an Index in a very long time. |
Quote:
Quote:
I can see some potential problems with this, as the page numbers are on the bottom of the pages, and some pages are empty, which may confuse the issue, but that might be something I could prompt for. Quote:
-- Food for thought here folks, thanks a lot! |
Yes a postscript (.ps) file is typically generated by a postscript printer driver (it is what is spooled and set to a postscript compatible printer) and contains the commands in text that the printer uses to make it actually print. There are many opensource (ghostscript, cups, pdf2ps on Linux) and commercial tools that can do this (Adobe Acrobat Pro). A ps file is very similar to a pdf file that has been unencoded, and uncompressed.
Extracting the info you need from one should be relatively easy. As for missing page labels (or numbers) these can easily be filled in and interpolated correctly from known page labels or values using a simple spreadsheet or software as it is a one-to-one mapping. If you can use pdf2ps on your machine it should be easy enough to look at the postscript file in a text editor and look for "showpage" and decide for yourself how hard it would be to extract just what you need. |
Quote:
Anyway, it's not anything I've ever done in an actual EPUB, just theoretical musings. All the logic is sound though. :D [/QUOTE] And there it is. :-) That stuff is just bloody tedious. I think it would be fun to write programming or clips, etc., to do it...but HAVING to do it, commercially, is the dog's south end. Quote:
Hitch |
Quote:
Acrobat also exports to txt, rtf, doc, docx etc, so I could imagine writing a python script that analyses such a file. That might give me the strings I could use to iterate through an epub html file and add anchor tags, that I could then link the index entries to. I'd need to account for whitespace, the existence of potential html tags, and other things, probably, but this seems relatively straightforward. With some sophistication - as the indices feature page ranges and note numbers, too - I might be able to automate the whole thing. Seeing as there are thousands of pages - and thousands of index entries per volume - here, it's definitely worth a try. edit: oh, no, scratch that. There are footnotes, a lot of them, clouding the issue in the PDF2xxx output, which I need to disregard, without losing the numbered lists. Also, as the page numbers are in the footers. And of course there are also headers, which I should also disregard. At this point, perhaps I am better off just working from the PDF in the first place, which is not so bad all things considered. |
@Ryn: Do you know that Sigil has a built-in index tool? To use it, all you have to do is generate a plain text file with index entries, e.g. index.txt, and select the following:
|
Quote:
Thing is, these are custom-built indexes with thousands of entries per volume. I doubt I could do remotely as good a job as the original indexers, who likely spent upward of forty hours on each index. Needless to say, these books were not created to make any profit whatsoever, and that also goes for the digital edition we're currently putting together. The foundation which has enlisted my help has as a core value the dissemination of these texts, and keeping them safely available for future generations. I generally dissuade clients from including indexes, but in this case I am willing to make an exception. And I personally resonate with the subject, so my participation is not a chore at all. That being said, I dislike unnecessary monotonous labor as much as most people, if not more, so being smart about it and using tech to my advantage, I'm all for that! |
Quote:
Thanks Becky! Now I just need to write me some python logic to rid myself of the task of manually linking the index to the pages. But that's the fun part :) Laura mentioned another script that might actually serve my purpose even better: LiveIndex, found here: https://www.id-extras.com/products/liveindex/ I'm mentioning it, just in case anyone else ever comes across a similar use case. |
I see you found your solution but for the record there are ps2txt and ps2ascii you can use to display these as well as this useful article I used in the past:
https://www.cs.waikato.ac.nz/~ihw/pa...tract-Text.pdf It prepends a short and sweet extra postscript function to the original postscript which redefines the show methods to give you text output that would be easier to parse. FWIW, I find working with "ps printer device" can extract text electronically that is very hard to get to in other ways without a scanner. |
And you need to worry that you are linking to the top of a "printed" page and not the word itself which may appear no where on the actual screen as that "printed page" will generally be much longer than the screen holds. So it will just get them close at best.
In many ways, a good search function replaces the need for indexes almost completely. Quote:
|
Quote:
But... Whereas searching is active, it presupposes you know exactly what you are searching for, whereas you might not always know what you don't know. An Index, otoh, has done this work for you, and then some. A good index will have collated different locations pertaining to the way "angular momentum" pertains to "diesel engines," for example. (Not even sure that that is a thing, but allow me the liberty.) This passive searching allows for a deeper sense of discovery in books that are more encyclopedic in scope. Not relevant to the vast majority of books that reaches our devices, I would be the first to agree, but in some cases, very much a desirable addition. |
Quote:
|
I've always felt that traditional indices in ebooks were a bit pointless. Anachronistically so. If print book makers could have made pages automatically turn and words on those pages to glow simply by saying a word aloud, then they'd have done so, and print indices would probably never have become a thing. And we wouldn't now be seeing people being forced by client dollars to try to simulate what a simple search engine can do with hypertext markup and millions of hardcoded links to and fro.
But I digress. ;) P.S. I've heard the "if you don't know what you need to search for" argument before, and I don't quite buy it. People who have no idea what they're looking for typically aren't looking for anything. And even if they were, manually wading through enormous, alphabetized, electronic indices is unlikely to focus their efforts very much. |
Quote:
In a way, I was plumbing new depths of understanding of the world I was born into. Not by the focused act of searching, but by discovery. I think some - not many, but some - books lend themselves to that wideranging form of learning. I think for instance, that people who are strongly motivated to deepen their understanding of their religion are much helped by guidance in the form of an index, or similar means. |
Quote:
Quote:
The fact of the matter is: only print index lovers love electronic indices. *shrug* |
But I'm not here to discourage anyone from their electronic "indexical" pursuits. Just rambling, really. They're no skin off my back. :)
|
Quote:
Another example would be an atlas, which many people don't use to find stuff but rather to educate themselves, to wander, or to heuristically discover new territory. I have yet to meet an atlas I have memorized, and use google earth to much the same effect. Granted, indexes are clunky, abstract, and often quite subjective. Still, in the correct contexts, be it technical works, religious tomes, or dictionaries (basically one immense index), they can be quite useful. The fact that you "never don't know what [you] want to search for in a particular book anymore" really is neither here nor there. |
Quote:
You may not agree with it, but it's just as valid and relevant as your assertion that electronic indices can't always be replaced with a search engine. Quote:
|
Quote:
Index nerd here! I love me some indices and EVEN WHEN it's in an ebook, where, indeedy, it's useless, I like the fact that I can assess the index and see how many references to topic A were worth mentioning in the index; how many references to John Doe and so forth. Quote:
Hitch |
Quote:
Quote:
Quote:
Quote:
Another part of my problem with electronic indices and concordances stems from the fact that their entire reason for being has been changed entirely in the electronic medium shift. They went from from being purely reference-based, to purely navigation-based. Navigation aids I don't need. Page-turns and searching suffice. |
Quote:
Quote:
If it's a project I'm working on from scratch, I insist on unlinked indexes. :) Quote:
Quote:
Take this for example: Code:
famous philosophersSearch (in ebooks) also doesn't typically match related words like: "philosophy" or "philosophies" or "philosophical". A good Indexer would be able to pre-categorize + organize the information, throwing out a lot of the "irrelevant hits", while at the same time combining all those "related words" together. And as Hitch said, you could use the index to get a very broad overview of WHAT information is covered in a given book. Even the size of the entries can tell you how "important" an author thinks a topic is. For example, the author may consider Aristotle to be more important than Aquinas (4 vs. 1). Note: Me + Hitch (and others) discussed the pros/cons of Indexes/Search at extreme length in the 2016 "Sick of Amazon Kindle books without Page Numbers" thread. Quote:
Absolutely fantastic title. When I first heard of it, I thought: "Who the heck doesn't know how to read a book?" Well, I didn't know... I didn't know... :D And it completely changed the way I read Non-Fiction + view Indexes. Here's one blog article also discussing the book: "How to Read a Book: The Ultimate Guide by Mortimer Adler" * * * And here's a relevant excerpt of Chapter 4, "The Second Level of Reading: Inspectional Reading": Spoiler:
Even just skimming an Index (or well-designed Table of Contents) can give you lots of helpful information. This is why I mostly don't mind leaving unlinked indexes in ebooks (they don't hurt, and can only help, even in ways that pure search can't accomplish). |
| All times are GMT -4. The time now is 07:08 PM. |
Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.