MobileRead Forums

MobileRead Forums (https://www.mobileread.com/forums/index.php)
-   Sigil (https://www.mobileread.com/forums/forumdisplay.php?f=203)
-   -   How to export indexes from indesign to epub (https://www.mobileread.com/forums/showthread.php?t=334966)

Ryn 11-20-2020 05:00 AM

How to export indexes from indesign to epub
 
Hi all,

I am involved in a project that involves exporting to epub 15-20 books of about 500 pages each.

All these books have large and essential indexes that need to make it into the epub, preferably linked.

As the indexes were created by hand, and outside of indesign, that program does not know how to make them active. It would be really nice if indesign knew how to link page numbers, which it self-creates, to a list of page numbers in the same document. Sadly, it does not.

As one of the source files was not in indesign, but in QuarkXpress, and in a version which I do not have, I have experimented with loading the PDF into Acrobat and exporting it to a Word file. This gives me the page numbers, which are carried with all the other header and footer stuff into the word file.

I then did the following:
  1. Exporting the Word document to ePub with OpenOffice Writer's writer2ePub plugin. This yields an epub with for each page an xhtml document.
  2. In Sigil, regexing the page numbers of the book to a self-closing <a> tag with the proper id - like pxxx - and moving them into the topmost paragraph tag of the page.
  3. Then merging all the pages that are not the index into one xhtml file. Doing the same for the pages that are part of the index.
  4. Iterating through the index file, using regexes to find page numbers and link them to the proper anchor inside the book document. There are ins and outs to this, that I will gloss over here.
  5. Finally, splitting the book up into its logical chapters - usually one xhtml file per chapter, same with the notes etc. Thankfully, Sigil knows how to manage the links, once xhtml files are broken up.
  6. Sadly, bc of the PDF export, there is a lot of cleaning still to do, as hyphenation is not understood by acrobat's PDF reading and exporting system. Also, there will be headers and/or footers, lots of unwanted hard and soft returns, whitespace, and no styles except (hopefully) italics, bold, super and subscripts. Also, the index now only links to the page, and not to the proper paragraph or sentence, which might be possible in an ePub, but is now beyond the pale for this project.

It is definitely a half-way solution, but one that can be simplified to some extent within Sigil, by using the Saved Searches functionality.

The whole index thing took me about two hours, from start to finish, for a 4k entry index. One book down, fourteen (or more) to go.

I am posting this as a hack, but in this community it is likely there are minds that can spot weaknesses in my approach. Feel free to shoot holes, and help me out for the remainder of the series.

------
As an edit: for those who come after, looking for a solution, there are valuable comments in the thread.

Outstanding ones from my perspective have been the realization that there are indesign plugins that export the page numbers into epub, here:

Quote:

Originally Posted by BeckyEbook
http://epubsecrets.com/why-i-use-page-list-and-how.php
http://epubsecrets.com/page-list-all...e-doing-it.php

The link to the script is dead, so I'm listing it from web archive:
http://web.archive.org/web/201912181...orohikoscripts

You can write directly to Laura, but I have a feeling you'd better check out the "EPUB Accessibility Using InDesign" video tutorial (available from Lynda.com or Linkedin), which AFAIK includes the PageStaker and EPUBOgrify script.

Saving the best for last, there is actually a script that links "dead" indexes to their indesign page numbers, and exports them to epub. It's called LiveIndex, and can be bought here. (I have no relationship to this developer.)

https://www.id-extras.com/products/liveindex/

DiapDealer 11-20-2020 10:32 AM

Before you get impatient, Jon, and unilaterally declare this as irrelevant to Sigil... don't. I've chosen to leave it here, as there could be Sigil-specific ideas presented for improvement.

Doitsu 11-20-2020 01:22 PM

Quote:

Originally Posted by Ryn (Post 4059911)
All these books have large and essential indexes that need to make it into the epub, preferably linked.

Depending on the structure and format of your book you might find the following simple plugins that I wrote helpful:

FootnoteLinker

PageList

Incremental IDs

Also check out all other Sigil plugins.

Tex2002ans 11-20-2020 01:36 PM

Quote:

Originally Posted by Ryn (Post 4059911)
All these books have large and essential indexes that need to make it into the epub, preferably linked.

In ebooks, "Real Page Numbers" (RPNs) are a pain and aren't as helpful as they seem...

And Linked Indexes, to do them properly, requires a massive amount of manual intervention/cleanup.

Over the years, Me + Hitch have written an enormous amount on these two topics. See some of the latest discussion from earlier this year:

(And if you want to know more about RPNs + Indexes in ebooks, read/follow all those links to all the other threads where we cover every pro/con from every angle.)

Quote:

Originally Posted by Ryn (Post 4059911)
As the indexes were created by hand, and outside of indesign, that program does not know how to make them active. It would be really nice if indesign knew how to link page numbers, which it self-creates, to a list of page numbers in the same document. Sadly, it does not.

Even if you created InDesign Indexes, InDesign still doesn't export all the proper page numbers (pageList + page-list) to EPUB.

Back in 2016, David Kudler asked a similar question in "Getting InDesign to export pagelists to ePub3 (reflowable)".

I pointed to a 2015 article written by Joshua Tallent "How to Add a page-list to an EPUB" (now dead, so here's an Archive.org backup) + a 2015 article from EPUBSecrets, "Page List: All the Cool Ebook Developers Are Doing It".

To my knowledge, not much has really changed since.

Many of those methods require you to put some sort of tag/character at the end of each page, then convert that using some outside tool, then manually link all the Index links.

Just a few months ago, I wrote one such method in "Create index on epub from printed book".

Quote:

Originally Posted by Ryn (Post 4059911)
[...] This gives me the page numbers, which are carried with all the other header and footer stuff into the word file.

I then did the following:

Great, great... I'll give some rolling commentary on your steps.

Quote:

Originally Posted by Ryn (Post 4059911)
1. Exporting the Word document to ePub with OpenOffice Writer's writer2ePub plugin. This yields an epub with for each page an xhtml document.

Okay. A page-per-XHTML-file might be a good way to leave some sort of marker on every page.

Then you could potentially use the method I pointed out above in "Create index on epub from printed book", then use Doitsu's "Incremental IDs plugin":

Chapter01:

Code:

<h1>Chapter 1</h1>
<p>This is the beginning.</p>
<p class="pagenum">1</p> <--- Footer exported from the original document.

Convert with Incremental IDs (or Regex):

Code:

<h1>Chapter 1</h1>
<p>This is the beginning.</p>
<span epub:type="pagebreak" id="page1" title="1"/>

Move Top:

Code:

<span epub:type="pagebreak" id="page1" title="1"/>
<h1>Chapter 1</h1>
<p>This is the beginning.</p>

Note: That <span epub:type> format is EPUB3. EPUB2 you would use slightly different code.

Quote:

2. In Sigil, regexing the page numbers of the book to a self-closing <a> tag with the proper id - like pxxx - and moving them into the topmost paragraph tag of the page.
That works too... If you are able to export reliable headers/footers.

Although that's usually a lot of extra cruft you usually have to clean up and sift through.

Depending on the document, it may be best to trash all the header/footers (or not export them at all), then renumber from scratch using some other tools.

Quote:

3. Then merging all the pages that are not the index into one xhtml file. Doing the same for the pages that are part of the index.
Yep, this is great. Let Sigil/Calibre do the hard work for you:

1. You merge the entire book into one or two monolithic XHTML file/s.

Let's call them:
  • merged.xhtml
  • Index.xhtml

2. You can then add your <a>s around all your page numbers in your Index:

Index (Before):

Code:

<p>Dogs, 1</p>
Index (After Regex):

Code:

<p>Dogs, <a id="index-dog-1" href="../Text/merged.xhtml#page1">1</a></p>
3. Then use Sigil to split the XHTML files into their individual chapters, and all the anchors will be automatically updated to the files/chapters they belong:

Code:

<p>Dogs, <a id="index-dog-1" href="../Text/Chapter01.xhtml#page1">1</a></p>
Quote:

4. Iterating through the index file, using regexes to find page numbers and link them to the proper anchor inside the book document. There are ins and outs to this, that I will gloss over here.
Yeah, it's a complicated mess.

Forms like "381–385" vs. "381–5" OR "385n10".

Indexes are extremely information dense and come in many variations, and usually it's not just a simple page number.

I went into some details in the "Create Index [...]" topic above. (Plus definitely in the famous "Real Page Numbers" topics.)

Quote:

5. Finally, splitting the book up into its logical chapters - usually one xhtml file per chapter, same with the notes etc. Thankfully, Sigil knows how to manage the links, once xhtml files are broken up.
:thumbsup:

And once you linkify the Index, just be careful of the ~300KB soft filesize limit for EPUB. Sometimes the Indexes get so large, you have to split them into 2 or more files.

Quote:

6. Sadly, bc of the PDF export, there is a lot of cleaning still to do, as hyphenation is not understood by acrobat's PDF reading and exporting system. Also, there will be headers and/or footers, lots of unwanted hard and soft returns, whitespace, and no styles except (hopefully) italics, bold, super and subscripts.
Yep, having a clean source document is the most important step.

I discussed a lot of that back in 2019, "Workflow for simultaneous EPUB and PDF production?"

Quote:

Also, the index now only links to the page, and not to the proper paragraph or sentence, which might be possible in an ePub, but is now beyond the pale for this project.
... yeah, down to the sentence/paragraph-level ain't happening any time soon... Me + Hitch have also discussed that one to death. :D

It would require the Indexer to have access to the actual source files + the perfect mix of skills that very few are even equipped for.

Page numbers and/or Chap.Subchap are about as good as you're going to get.

Side Note: For just a piece of that discussion, look at my 2016 Post #129 from "Sick of Amazon Kindle books without Page Numbers...". I came up with this concept of "Format-Specific" and "Format-Neutral", and I still think it's a genius analysis. :D

Hitch 11-20-2020 03:04 PM

INDEXES!

"Niagara Falls, slowly I turn, step-by-step..."

OH, NEVER MIND. I was trying to put the Abbott & Costello Niagra Falls video in here. If you want to see it, go here: https://www.youtube.com/watch?v=8KpsUlvzbkk



Hitch, frequent Index Victim

DNSB 11-20-2020 03:15 PM

Quote:

Originally Posted by Hitch (Post 4060123)
INDEXES!

"Niagara Falls, slowly I turn, step-by-step..."

OH, NEVER MIND. I was trying to put the Abbott & Costello Niagra Falls video in here. If you want to see it, go here: https://www.youtube.com/watch?v=8KpsUlvzbkk


Hitch 11-20-2020 03:57 PM

Quote:

Originally Posted by DNSB (Post 4060128)

Thank you. I don't know why I seem to have a brain-blank about how to make it work in this forum.

Hitch

JSWolf 11-20-2020 05:34 PM

Quote:

Originally Posted by Hitch (Post 4060123)
INDEXES!

"Niagara Falls, slowly I turn, step-by-step..."

OH, NEVER MIND. I was trying to put the Abbott & Costello Niagra Falls video in here. If you want to see it, go here: https://www.youtube.com/watch?v=8KpsUlvzbkk



Hitch, frequent Index Victim

Three Stooges Niagara Falls.


Ryn 11-21-2020 07:54 AM

Quote:

Originally Posted by Tex2002ans (Post 4060079)
In ebooks, "Real Page Numbers" (RPNs) are a pain and aren't as helpful as they seem...

I hope you don't mind if I'm not quoting and responding to the entire post. I'm all in favor of being complete, but legibility is also a concern.

First of all, thanks for your extensive reply.

Second, I'm not at all interested in having parity between book page numbers and e-book page numbers - RPNs as you call them. I have tried implementing such things from time to time, but have found reader implementation spotty, and can see little added value for either reader or publisher.

You mention some alternatives to the steps I have used; I will investigate them further when I have occasion to do so - it's always good to have multiple paths home :)

As the rest of this project consists of indesign files, and I generally don't export those to epub page-by-page, I was considering working from the PDFs that indesign outputs. Converting them to word docs with acrobat, and then exporting those to epub using oowriter seems the way to go to easily get the page numbers. It remains a hassle to clean all the cruft out, though.

I's welcome a way to do things more easily through indesign, perhaps using the method you mention where each page gets a special character which Sigil can replace with page-break tags, and then to use the Sigil plugin for serialized ids you mention.

As I never use indesign for anything except making epub exports, I would welcome some input as to how to go about this in indesign. It's not my favorite program, although my limited experiences with quark have managed to knock the adobe product off the utmost bottom rank.

Tex2002ans 11-22-2020 02:25 AM

Quote:

Originally Posted by Ryn (Post 4060340)
First of all, thanks for your extensive reply.

No problem. (I'm slightly famous around here for that. :D)

Quote:

Originally Posted by Ryn (Post 4060340)
As the rest of this project consists of indesign files, and I generally don't export those to epub page-by-page, I was considering working from the PDFs that indesign outputs. Converting them to word docs with acrobat, and then exporting those to epub using oowriter seems the way to go to easily get the page numbers. It remains a hassle to clean all the cruft out, though.

Export to EPUB directly from InDesign, then clean up the HTML from there. This will carry over all the original markup (Headings, Italics, Smallcaps, etc.).

Remember, a book isn't just pure text, the underlying formatting is just as important. :)

PDF is one of the worst input formats there is, and you'll lose much of the original markup + introduce errors and other junk while converting to any other formats.

It's almost always better to always go from:

Source -> EPUB (Directly)

than to do:

Source -> PDF -> Word -> EPUB

where each step in the chain may introduce more issues.

* * *

If InDesign File Is Using Styles

Great. You're going to have an easier job.

In InDesign, there's such a thing as Style Mapping:

If InDesign File Is NOT Using Styles

Prepare for pain... :D

(This is the more likely scenario, since 99%+ of people who use InDesign/Word/LibreOffice don't know or use Styles when designing documents.)

You'll have to manually clean up all the code, and every single book is going to generate wildly different cruft. And boy, oh boy, does InDesign love to generate iBooks-friendly bloat in their CSS.

Side Note: On Styles...

I've also written about Why/How Styles are so important, most recently:

I think this is #1 the most important step there can be. Clean input helps EVERY single step down the line.

If people designed their documents with Styles+Accessibility in mind first, it would make everyone's life much easier. :)

(While steps between programs are different, the Styles concept is similar across all.)

Quote:

Originally Posted by Ryn (Post 4060340)
As I never use indesign for anything except making epub exports, I would welcome some input as to how to go about this in indesign. It's not my favorite program, although my limited experiences with quark have managed to knock the adobe product off the utmost bottom rank.

And I also try to get everything out of InDesign ASAP.

~100% of the InDesign work I get is... directly formatted... so it's a mess. I've only met one designer who actually used InDesign with proper Styles.

Quote:

Originally Posted by Ryn (Post 4060340)
Second, I'm not at all interested in having parity between book page numbers and e-book page numbers - RPNs as you call them.

But I'm scratching my head over here...

If not using RPNs, then what's the clickable links you're trying to accomplish in the Index?

Are you trying to do a:

Code:

Cats, [1], [2], [3]
Dogs, [1], [2], [3]

(Sigil Index style?)

* * *

But RPNs do serve some purpose, especially for Accessibility reasons (blind readers) + citations, book clubs, etc.

And for Linked Indexes, page #s seem to make a lot more sense.

Quote:

Originally Posted by Ryn (Post 4060340)
I have tried implementing such things from time to time, but have found reader implementation spotty, and can see little added value for either reader or publisher.

On Usability of these "many-to-one"-type links...

In your favorite search engine, type:

Code:

many-to-one Hitch site:mobileread.com
That'll lead you to many threads over the years where Hitch discusses them. This issue is severe in Glossaries, Indexes, and even sometimes Footnotes.

Ryn 12-04-2020 06:51 AM

Hi Tex. Apologies for my late answer. Somehow, I didn't get notified of your reply, and as I do not visit MR every day, here we are.

I do use indesign for limited things like exporting files to epub, and am familiar with the mechanics and best practices of that route. Of course, it is often way easier to go the direct route than through PDF.

Perhaps my question was not as clear as it could have been. What I really wanted to know was: how do you export the page numbers in an indesign export to epub? It's not an option in the export dialogue box, nor is it something I can easily put together using indesign's byzantine search module.

Is there another way? It would be nice, seeing as most of the volumes in this project are in indesign. The reason I went the PDF route in the OP was bc that particular volume was created in some quark version I do not possess.

Hitch 12-04-2020 12:25 PM

Quote:

Originally Posted by Ryn (Post 4065229)
Hi Tex. Apologies for my late answer. Somehow, I didn't get notified of your reply, and as I do not visit MR every day, here we are.

I do use indesign for limited things like exporting files to epub, and am familiar with the mechanics and best practices of that route. Of course, it is often way easier to go the direct route than through PDF.

Perhaps my question was not as clear as it could have been. What I really wanted to know was: how do you export the page numbers in an indesign export to epub? It's not an option in the export dialogue box, nor is it something I can easily put together using indesign's byzantine search module.

Is there another way? It would be nice, seeing as most of the volumes in this project are in indesign. The reason I went the PDF route in the OP was bc that particular volume was created in some quark version I do not possess.

I'm sorry, I literally feel an idiot asking this, but: do you mean, when you ask about the 'page numbers," the page numbers for the actual pages? From INDD to ePUB? Is that what you're asking about?

Hitch

Ryn 12-04-2020 03:15 PM

Quote:

Originally Posted by Hitch (Post 4065334)
I'm sorry, I literally feel an idiot asking this, but: do you mean, when you ask about the 'page numbers," the page numbers for the actual pages? From INDD to ePUB? Is that what you're asking about?

Hitch

Yup. For linking an extensive and essential index, created outside of indesign and which indesign does not recognize.

Edit: not one index in fact but dozens, in a big project.

BeckyEbook 12-04-2020 03:41 PM

I'm not sure if this is exactly what you need, but I will post a few links that may lead you to come up with your own solution.

http://epubsecrets.com/why-i-use-page-list-and-how.php
http://epubsecrets.com/page-list-all...e-doing-it.php

The link to the script is dead, so I'm listing it from web archive:
http://web.archive.org/web/201912181...orohikoscripts

You can write directly to Laura, but I have a feeling you'd better check out the "EPUB Accessibility Using InDesign" video tutorial (available from Lynda.com or Linkedin), which AFAIK includes the PageStaker and EPUBOgrify script.
The latter is not so important anyway, because it is a simple change that can be done in Sigil.

Hitch 12-04-2020 03:47 PM

Quote:

Originally Posted by Ryn (Post 4065398)
Yup. For linking an extensive and essential index, created outside of indesign and which indesign does not recognize.

Edit: not one index in fact but dozens, in a big project.


You don't. There isn't any easy or magic or automatic way to export the RPNs (Real Page Numbers). You create them manually.

You open up the ePUB; you open up the PDF. You find the first page-end. You search for that bit of text--typically, 5-10 characters will do. When you find it, you create the anchor, like P01, P02, etc.

Then, after all the anchors are done: then you write a script if you're lucky--or do it manually if you aren't--that links all the index entries that go to page 1, to P01, all the index entries that go to P02, to 2 and so forth.

That's it. Knowing Tex, he has some mad coding that will do some of this more easily than I've described, but that's the fundamental process, right there. And it's entirely possible that there are Sigil or Calibre addins that already do the 2nd part, the linking part, that I don't know about, as my band of Merry Minions use our internal, proprietary clips/programs to do that.

That's the basic procedure.

Hitch

Hitch 12-04-2020 03:49 PM

Quote:

Originally Posted by BeckyEbook (Post 4065405)
I'm not sure if this is exactly what you need, but I will post a few links that may lead you to come up with your own solution.

http://epubsecrets.com/why-i-use-page-list-and-how.php
http://epubsecrets.com/page-list-all...e-doing-it.php

The link to the script is dead, so I'm listing it from web archive:
http://web.archive.org/web/201912181...orohikoscripts

You can write directly to Laura, but I have a feeling you'd better check out the "EPUB Accessibility Using InDesign" video tutorial (available from Lynda.com or Linkedin), which AFAIK includes the PageStaker and EPUBOgrify script.
The latter is not so important anyway, because it is a simple change that can be done in Sigil.

See, there you go. I knew somebody would have some clips.

Hitch

phillipgessert 12-04-2020 05:04 PM

I have not tried this (and frankly even if it works it still sounds pretty miserable) but I wonder if you could work page-by-page unlocking whatever master page element includes the page number, and then use a plugin such as https://www.rorohiko.com/wordpress/i...ds/textstitch/ to auto-thread the page numbers into the document flow.

Hitch 12-04-2020 05:29 PM

Quote:

Originally Posted by phillipgessert (Post 4065430)
I have not tried this (and frankly even if it works it still sounds pretty miserable) but I wonder if you could work page-by-page unlocking whatever master page element includes the page number, and then use a plugin such as https://www.rorohiko.com/wordpress/i...ds/textstitch/ to auto-thread the page numbers into the document flow.

Oh, Phillip! Who knew you had such masochistic leanings? Quick--what's your safe word?

:2thumbsup

Hitch

KevinH 12-04-2020 05:39 PM

If you have pdf of printed version of book, you should be able to print to a postscript file and use python on that postscript file to extract the page numbers and the first n words and last n words on each page (where n is small say 3) and save that info to a file. Then use sed or some other stream editor with that info to insert the markers you want in each html file.

Some custom programming in python might be needed but should be reusable for future projects.

Tex2002ans 12-04-2020 06:39 PM

Quote:

Originally Posted by BeckyEbook (Post 4065405)
I'm not sure if this is exactly what you need, but I will post a few links that may lead you to come up with your own solution.

Fantastic. Thanks for sharing.

Quote:

Originally Posted by Ryn (Post 4065229)
What I really wanted to know was: how do you export the page numbers in an indesign export to epub? It's not an option in the export dialogue box, nor is it something I can easily put together using indesign's byzantine search module.

Is there another way?

BeckyEbook's links would likely work as well. Those 2018 articles are probably better + more modern than the older articles I linked in my Post #4.

Quote:

Originally Posted by Hitch (Post 4065408)
You don't. There isn't any easy or magic or automatic way to export the RPNs (Real Page Numbers). You create them manually.

Yep, you would think it would be a checkbox in InDesign... especially with how much Adobe talks Accessibility.

Quote:

Originally Posted by Hitch (Post 4065408)
Knowing Tex, he has some mad coding that will do some of this more easily than I've described, but that's the fundamental process, right there.

I see you didn't read all the links in my earlier post #4!

(Typical Hitch, never reading anything I write! :rofl:)

Anyway, it's not anything I've ever done in an actual EPUB, just theoretical musings. All the logic is sound though. :D

Luckily, I haven't had to do an Index in a very long time.

Ryn 12-04-2020 06:54 PM

Quote:

Originally Posted by BeckyEbook (Post 4065405)
I'm not sure if this is exactly what you need, but I will post a few links that may lead you to come up with your own solution.

http://epubsecrets.com/why-i-use-page-list-and-how.php
http://epubsecrets.com/page-list-all...e-doing-it.php

The link to the script is dead, so I'm listing it from web archive:
http://web.archive.org/web/201912181...orohikoscripts

You can write directly to Laura, but I have a feeling you'd better check out the "EPUB Accessibility Using InDesign" video tutorial (available from Lynda.com or Linkedin), which AFAIK includes the PageStaker and EPUBOgrify script.
The latter is not so important anyway, because it is a simple change that can be done in Sigil.

Thank you for digging up the web archive link for me. This might be just what I'm looking for.

Quote:

Originally Posted by KevinH
If you have pdf of printed version of book, you should be able to print to a postscript file and use python on that postscript file to extract the page numbers and the first n words and last n words on each page (where n is small say 3) and save that info to a file. Then use sed or some other stream editor with that info to insert the markers you want in each html file.

Some custom programming in python might be needed but should be reusable for future projects.

This might also be something I'd consider doing, seeing how big the project is, and how much I loathe working from PDFs. By postscript file, do you mean a text file?

I can see some potential problems with this, as the page numbers are on the bottom of the pages, and some pages are empty, which may confuse the issue, but that might be something I could prompt for.

Quote:

Originally Posted by phillipgessert
I have not tried this (and frankly even if it works it still sounds pretty miserable) but I wonder if you could work page-by-page unlocking whatever master page element includes the page number, and then use a plugin such as https://www.rorohiko.com/wordpress/i...ds/textstitch/ to auto-thread the page numbers into the document flow.

Or I could try my hand at programming an indesign plugin for this express purpose. How hard could it be to get a script to recognize the page numbers, and to cross-reference the indexed page numbers to the first word on the relevant page? Famous last words, I'm sure...

--
Food for thought here folks, thanks a lot!

KevinH 12-04-2020 07:37 PM

Yes a postscript (.ps) file is typically generated by a postscript printer driver (it is what is spooled and set to a postscript compatible printer) and contains the commands in text that the printer uses to make it actually print. There are many opensource (ghostscript, cups, pdf2ps on Linux) and commercial tools that can do this (Adobe Acrobat Pro). A ps file is very similar to a pdf file that has been unencoded, and uncompressed.

Extracting the info you need from one should be relatively easy. As for missing page labels (or numbers) these can easily be filled in and interpolated correctly from known page labels or values using a simple spreadsheet or software as it is a one-to-one mapping.

If you can use pdf2ps on your machine it should be easy enough to look at the postscript file in a text editor and look for "showpage" and decide for yourself how hard it would be to extract just what you need.

Hitch 12-04-2020 08:25 PM

Quote:

Originally Posted by Tex2002ans (Post 4065465)

I see you didn't read all the links in my earlier post #4!

(Typical Hitch, never reading anything I write! :rofl:)

Oh, you are being a [something or other]. You know I do, in fact, read all your stuff. (Go ahead, I defy you to find other human beings that do!). I just don't always remember every single thing. And...well, yes, I may on occasion skim. It's not like you write 300-word Blogger posts, now, is it. You're like the Anti-Twitter. You're the one guy I can always count on to make me look terse, my brother.

Anyway, it's not anything I've ever done in an actual EPUB, just theoretical musings. All the logic is sound though. :D [/QUOTE]

And there it is. :-) That stuff is just bloody tedious. I think it would be fun to write programming or clips, etc., to do it...but HAVING to do it, commercially, is the dog's south end.

Quote:

Luckily, I haven't had to do an Index in a very long time.
You ARE lucky!

Hitch

Ryn 12-05-2020 04:54 AM

Quote:

Originally Posted by KevinH (Post 4065479)
Yes a postscript (.ps) file is typically generated by a postscript printer driver (it is what is spooled and set to a postscript compatible printer) and contains the commands in text that the printer uses to make it actually print. There are many opensource (ghostscript, cups, pdf2ps on Linux) and commercial tools that can do this (Adobe Acrobat Pro). A ps file is very similar to a pdf file that has been unencoded, and uncompressed.

Extracting the info you need from one should be relatively easy. As for missing page labels (or numbers) these can easily be filled in and interpolated correctly from known page labels or values using a simple spreadsheet or software as it is a one-to-one mapping.

If you can use pdf2ps on your machine it should be easy enough to look at the postscript file in a text editor and look for "showpage" and decide for yourself how hard it would be to extract just what you need.

I have looked at this, but both pdf2ps and acrobat's postscript output are heavy on code, and feature encrypted text, rendering those avenues relatively useless.

Acrobat also exports to txt, rtf, doc, docx etc, so I could imagine writing a python script that analyses such a file.

That might give me the strings I could use to iterate through an epub html file and add anchor tags, that I could then link the index entries to. I'd need to account for whitespace, the existence of potential html tags, and other things, probably, but this seems relatively straightforward.

With some sophistication - as the indices feature page ranges and note numbers, too - I might be able to automate the whole thing.

Seeing as there are thousands of pages - and thousands of index entries per volume - here, it's definitely worth a try.

edit: oh, no, scratch that. There are footnotes, a lot of them, clouding the issue in the PDF2xxx output, which I need to disregard, without losing the numbered lists. Also, as the page numbers are in the footers. And of course there are also headers, which I should also disregard. At this point, perhaps I am better off just working from the PDF in the first place, which is not so bad all things considered.

Doitsu 12-05-2020 07:16 AM

@Ryn: Do you know that Sigil has a built-in index tool? To use it, all you have to do is generate a plain text file with index entries, e.g. index.txt, and select the following:
  • Tool > Index > Index Editor...
  • Right-click > Open > index.txt > Save
  • Tool > Index > Create Index
Obviously the index entries won't have page numbers, but you might be able to add them later with a custom Python script or a plugin.

Ryn 12-05-2020 08:54 AM

Quote:

Originally Posted by Doitsu (Post 4065606)
@Ryn: Do you know that Sigil has a built-in index tool? To use it, all you have to do is generate a plain text file with index entries, e.g. index.txt, and select the following:
  • Tool > Index > Index Editor...
  • Right-click > Open > index.txt > Save
  • Tool > Index > Create Index
Obviously the index entries won't have page numbers, but you might be able to add them later with a custom Python script or a plugin.

Hi there Doitsu, yeah I did know that. And I have considered using it.

Thing is, these are custom-built indexes with thousands of entries per volume. I doubt I could do remotely as good a job as the original indexers, who likely spent upward of forty hours on each index.

Needless to say, these books were not created to make any profit whatsoever, and that also goes for the digital edition we're currently putting together. The foundation which has enlisted my help has as a core value the dissemination of these texts, and keeping them safely available for future generations.

I generally dissuade clients from including indexes, but in this case I am willing to make an exception. And I personally resonate with the subject, so my participation is not a chore at all.

That being said, I dislike unnecessary monotonous labor as much as most people, if not more, so being smart about it and using tech to my advantage, I'm all for that!

Ryn 12-05-2020 09:26 AM

Quote:

Originally Posted by BeckyEbook (Post 4065405)
I'm not sure if this is exactly what you need, but I will post a few links that may lead you to come up with your own solution.

http://epubsecrets.com/why-i-use-page-list-and-how.php
http://epubsecrets.com/page-list-all...e-doing-it.php

The link to the script is dead, so I'm listing it from web archive:
http://web.archive.org/web/201912181...orohikoscripts

You can write directly to Laura, but I have a feeling you'd better check out the "EPUB Accessibility Using InDesign" video tutorial (available from Lynda.com or Linkedin), which AFAIK includes the PageStaker and EPUBOgrify script.
The latter is not so important anyway, because it is a simple change that can be done in Sigil.

This turns out to work like a charm, even when exporting to ePub 2, which I feared might be a problem.

Thanks Becky!

Now I just need to write me some python logic to rid myself of the task of manually linking the index to the pages. But that's the fun part :)

Laura mentioned another script that might actually serve my purpose even better: LiveIndex, found here: https://www.id-extras.com/products/liveindex/

I'm mentioning it, just in case anyone else ever comes across a similar use case.

KevinH 12-05-2020 10:56 AM

I see you found your solution but for the record there are ps2txt and ps2ascii you can use to display these as well as this useful article I used in the past:

https://www.cs.waikato.ac.nz/~ihw/pa...tract-Text.pdf

It prepends a short and sweet extra postscript function to the original postscript which redefines the show methods to give you text output that would be easier to parse.

FWIW, I find working with "ps printer device" can extract text electronically that is very hard to get to in other ways without a scanner.

KevinH 12-05-2020 11:01 AM

And you need to worry that you are linking to the top of a "printed" page and not the word itself which may appear no where on the actual screen as that "printed page" will generally be much longer than the screen holds. So it will just get them close at best.

In many ways, a good search function replaces the need for indexes almost completely.


Quote:

Originally Posted by Ryn (Post 4065632)
This turns out to work like a charm, even when exporting to ePub 2, which I feared might be a problem.

Thanks Becky!

Now I just need to write me some python logic to rid myself of the task of manually linking the index to the pages. But that's the fun part :)


Ryn 12-05-2020 11:12 AM

Quote:

Originally Posted by KevinH (Post 4065653)
And you need to worry that you are linking to the top of a "printed" page and not the word itself which may appear no where on the actual screen as that "printed page" will generally be much longer than the screen holds. So it will just get them close at best.

In many ways, a good search function replaces the need for indexes almost completely.

Well, yes and no. Of course, I always use the exact same argument when attempting to dissuade clients from insisting on index inclusion.

But... Whereas searching is active, it presupposes you know exactly what you are searching for, whereas you might not always know what you don't know.

An Index, otoh, has done this work for you, and then some. A good index will have collated different locations pertaining to the way "angular momentum" pertains to "diesel engines," for example. (Not even sure that that is a thing, but allow me the liberty.)

This passive searching allows for a deeper sense of discovery in books that are more encyclopedic in scope.

Not relevant to the vast majority of books that reaches our devices, I would be the first to agree, but in some cases, very much a desirable addition.

Ryn 12-05-2020 11:16 AM

Quote:

Originally Posted by KevinH (Post 4065650)
I see you found your solution but for the record there are ps2txt and ps2ascii you can use to display these as well as this useful article I used in the past:

https://www.cs.waikato.ac.nz/~ihw/pa...tract-Text.pdf

It prepends a short and sweet extra postscript function to the original postscript which redefines the show methods to give you text output that would be easier to parse.

FWIW, I find working with "ps printer device" can extract text electronically that is very hard to get to in other ways without a scanner.

Thanks for this. As the subject of postscript is entirely new to me, I am saving this for a moment when I have the opportunity to delve deeper into it.

DiapDealer 12-05-2020 11:17 AM

I've always felt that traditional indices in ebooks were a bit pointless. Anachronistically so. If print book makers could have made pages automatically turn and words on those pages to glow simply by saying a word aloud, then they'd have done so, and print indices would probably never have become a thing. And we wouldn't now be seeing people being forced by client dollars to try to simulate what a simple search engine can do with hypertext markup and millions of hardcoded links to and fro.

But I digress. ;)

P.S. I've heard the "if you don't know what you need to search for" argument before, and I don't quite buy it. People who have no idea what they're looking for typically aren't looking for anything. And even if they were, manually wading through enormous, alphabetized, electronic indices is unlikely to focus their efforts very much.

Ryn 12-05-2020 11:49 AM

Quote:

Originally Posted by DiapDealer (Post 4065662)

P.S. I've heard the "if you don't know what you need to search for" argument before, and I don't quite buy it. People who have no idea what they're looking for typically aren't looking for anything. And even if they were, manually wading through enormous, alphabetized, electronic indices is unlikely to focus their efforts very much.

I think this is a limited way of perceiving the breadth with which readers engage with books. From personal experience, I can recall spending days fascinated with the mysteries of an encyclopedia. I wasn't looking for anything in particular, but was simply curious about learning new things.

In a way, I was plumbing new depths of understanding of the world I was born into. Not by the focused act of searching, but by discovery.

I think some - not many, but some - books lend themselves to that wideranging form of learning. I think for instance, that people who are strongly motivated to deepen their understanding of their religion are much helped by guidance in the form of an index, or similar means.

DiapDealer 12-05-2020 12:25 PM

Quote:

Originally Posted by Ryn (Post 4065678)
I think this is a limited way of perceiving the breadth with which readers engage with books.

No. My eyes are wide open.

Quote:

Originally Posted by Ryn (Post 4065678)
From personal experience, I can recall spending days fascinated with the mysteries of an encyclopedia. I wasn't looking for anything in particular, but was simply curious about learning new things.

Same. Most of my childhood was spent in front of a massive set of the World Book Encyclopedia (complete with the Childcraft addons and annual supplements). I read them all from cover to cover (at random) because I was incredibly curious about learning everthing. Indices didn't enter into that picture, or that joy I experienced. By the time I was old enough to start needing to use an index, I would have killed for a search engine to focus my efforts. Now that ebooks are here (complete with search engines), I have zero interest in using someone else's curated, compiled points of interest in electronic indexes. I never don't know what I want to search for in a particular book anymore. Because I'm rarely searching for anything in a book I've not already read from cover to cover.

The fact of the matter is: only print index lovers love electronic indices. *shrug*

DiapDealer 12-05-2020 12:29 PM

But I'm not here to discourage anyone from their electronic "indexical" pursuits. Just rambling, really. They're no skin off my back. :)

Ryn 12-05-2020 01:17 PM

Quote:

Originally Posted by DiapDealer (Post 4065702)
No. My eyes are wide open.

Same. Most of my childhood was spent in front of a massive set of the World Book Encyclopedia (complete with the Childcraft addons and annual supplements). I read them all from cover to cover (at random) because I was incredibly curious about learning everthing. Indices didn't enter into that picture, or that joy I experienced. By the time I was old enough to start needing to use an index, I would have killed for a search engine to focus my efforts. Now that ebooks are here (complete with search engines), I have zero interest in using someone else's curated, compiled points of interest in electronic indexes. I never don't know what I want to search for in a particular book anymore. Because I'm rarely searching for anything in a book I've not already read from cover to cover.

The fact of the matter is: only print index lovers love electronic indices. *shrug*

I'm just saying that the index, especially in encyclopedic works, can function just as the encyclopediae we both loved so much growing up.

Another example would be an atlas, which many people don't use to find stuff but rather to educate themselves, to wander, or to heuristically discover new territory. I have yet to meet an atlas I have memorized, and use google earth to much the same effect.

Granted, indexes are clunky, abstract, and often quite subjective. Still, in the correct contexts, be it technical works, religious tomes, or dictionaries (basically one immense index), they can be quite useful.

The fact that you "never don't know what [you] want to search for in a particular book anymore" really is neither here nor there.

DiapDealer 12-05-2020 01:37 PM

Quote:

Originally Posted by Ryn (Post 4065728)
The fact that you "never don't know what [you] want to search for in a particular book anymore" really is neither here nor there.

Sure it is. :blink:

You may not agree with it, but it's just as valid and relevant as your assertion that electronic indices can't always be replaced with a search engine.

Quote:

Originally Posted by Ryn (Post 4065728)
Granted, indexes are clunky, abstract, and often quite subjective.

This was the better stopping point. It is my entire premise. The only thing missing is mention of the index's natural electronic successor: the search engine. Not everything needs to survive the medium shift. The joy and wonder of discovery and learning in the electronic era will easily survive the exclusion of what is, in essence, a vestigial print appendage.

Hitch 12-05-2020 02:18 PM

Quote:

Originally Posted by DiapDealer (Post 4065744)
Sure it is. :blink:

You may not agree with it, but it's just as valid and relevant as your assertion that electronic indices can't always be replaced with a search engine.

Hitch hesitantly raises her hand to address the assembly...

Index nerd here! I love me some indices and EVEN WHEN it's in an ebook, where, indeedy, it's useless, I like the fact that I can assess the index and see how many references to topic A were worth mentioning in the index; how many references to John Doe and so forth.


Quote:

This was the better stopping point. It is my entire premise. The only thing missing is mention of the index's natural electronic successor: the search engine. Not everything needs to survive the medium shift. The joy and wonder of discovery and learning in the electronic era will easily survive the exclusion of what is, in essence, a vestigial print appendage.
Yes, but as we all know, that's not an index, it's a concordance. Not the same thing. I would be the first to agree that by and large, including RPN-driven indices is worthless, but a search engine is not necessarily the natural electronic successor to an index; it's the successor to a concordance and as anyone who's tried to find something on Google knows, more isn't always better.

Hitch

DiapDealer 12-05-2020 03:42 PM

Quote:

Originally Posted by Hitch (Post 4065765)
Yes, but as we all know, that's not an index, it's a concordance. Not the same thing.

Quite debatable whether electronic search engine = print concordance. I would argue that the ability to search the entire text of a work has no print-based counterpart. But the ability to search a book's text certainly replaces the necessity for either a concordance or an index, in my opinion (and in my experience, since I've never once looked at, clicked on, or otherwise engaged with an electronic index or concordance). That's why searching is such a game changer in the P2E medium shift.

Quote:

Originally Posted by Hitch (Post 4065765)
I would be the first to agree that by and large, including RPN-driven indices is worthless,

Ding, ding ding.

Quote:

Originally Posted by Hitch (Post 4065765)
but a search engine is not necessarily the natural electronic successor to an index; it's the successor to a concordance

Ok. Perhaps successor was a poor choice of words. Suffice it to say that it is my opinion that a search engine eliminates the need for either (again: in my experience).

Quote:

Originally Posted by Hitch (Post 4065765)
and as anyone who's tried to find something on Google knows, more isn't always better.

Nope. Can't agree there. I'll take way too much information over someone else's notions of what they think I should be focusing on any day of the week. I'm not one who really believes (or succumbs to) information overload. Give me raw, unadulterated, and uncurated search results (whether online, or limited to the text of an ebook) and I'll take it from there--thank you very much.

Another part of my problem with electronic indices and concordances stems from the fact that their entire reason for being has been changed entirely in the electronic medium shift. They went from from being purely reference-based, to purely navigation-based. Navigation aids I don't need. Page-turns and searching suffice.

Tex2002ans 12-06-2020 05:01 AM

Quote:

Originally Posted by Hitch (Post 4065485)
Oh, you are being a [something or other]. You know I do, in fact, read all your stuff. (Go ahead, I defy you to find other human beings that do!). I just don't always remember every single thing. And...well, yes, I may on occasion skim. It's not like you write 300-word Blogger posts, now, is it. You're like the Anti-Twitter.

:rofl:

Quote:

Originally Posted by Hitch (Post 4065485)
And there it is. :-) That stuff is just bloody tedious. I think it would be fun to write programming or clips, etc., to do it...but HAVING to do it, commercially, is the dog's south end.

Only had to fix a bajillion of those rotten indexes that someone else created (and that was bad enough!).

If it's a project I'm working on from scratch, I insist on unlinked indexes. :)

Quote:

Originally Posted by Ryn (Post 4065632)
Laura mentioned another script that might actually serve my purpose even better: LiveIndex, found here: https://www.id-extras.com/products/liveindex/

I'm mentioning it, just in case anyone else ever comes across a similar use case.

Nice. I'll add it to my list.

Quote:

Originally Posted by Ryn (Post 4065659)
This passive searching allows for a deeper sense of discovery in books that are more encyclopedic in scope.

Not relevant to the vast majority of books that reaches our devices, I would be the first to agree, but in some cases, very much a desirable addition.

Yep. An Index also lets you find more "broad" concepts, not necessarily worded in the raw text itself.

Take this for example:

Code:

famous philosophers
        Aquinas, 10
        Aristotle, 1, 5, 60, 199
        Socrates, 20, 60

You might search for the word "philosopher", then have to sift through 100 (irrelevant) "philosopher" hits. And within the text, "Aquinas", "Aristotle", or "Socrates" might not appear near the word "philosopher" at all.

Search (in ebooks) also doesn't typically match related words like: "philosophy" or "philosophies" or "philosophical".

A good Indexer would be able to pre-categorize + organize the information, throwing out a lot of the "irrelevant hits", while at the same time combining all those "related words" together.

And as Hitch said, you could use the index to get a very broad overview of WHAT information is covered in a given book. Even the size of the entries can tell you how "important" an author thinks a topic is. For example, the author may consider Aristotle to be more important than Aquinas (4 vs. 1).

Note: Me + Hitch (and others) discussed the pros/cons of Indexes/Search at extreme length in the 2016 "Sick of Amazon Kindle books without Page Numbers" thread.

Quote:

Originally Posted by DiapDealer (Post 4065785)
Quite debatable whether electronic search engine = print concordance. I would argue that the ability to search the entire text of a work has no print-based counterpart. But the ability to search a book's text certainly replaces the necessity for either a concordance or an index, in my opinion (and in my experience, since I've never once looked at, clicked on, or otherwise engaged with an electronic index or concordance). That's why searching is such a game changer in the P2E medium shift.

Around the time of that famous 2016 thread, I read "How to Read a Book" by Mortimer Adler.

Absolutely fantastic title. When I first heard of it, I thought:

"Who the heck doesn't know how to read a book?"

Well, I didn't know... I didn't know... :D And it completely changed the way I read Non-Fiction + view Indexes.

Here's one blog article also discussing the book:

"How to Read a Book: The Ultimate Guide by Mortimer Adler"

* * *

And here's a relevant excerpt of Chapter 4, "The Second Level of Reading: Inspectional Reading":

Spoiler:
Quote:

Inspectional Reading I: Systematic Skimming or Pre-reading

Let us return to the basic situation to which we have referred before. There is a book or other reading matter, and here is your mind. What is the first thing that you do?

[...] First, you do not know whether you want to read the book. You do not know whether it deserves an analytical reading. But you suspect that it does, or at least that it contains both information and insights that would be valuable to you if you could dig them out.

Second, let us assume-and this is very often the case*that you have only a limited time in which to find all this out.

In this case, what you must do is skim the book, or, as some prefer to say, pre-read it. Skimming or pre-reading is the first sublevel of inspectional reading. Your main aim is to discover whether the book requires a more careful reading. Secondly, skimming can tell you lots of other things about the book, even if you decide not to read it again with more care.

Giving a book this kind of quick once-over is a threshing process that helps you to separate the chaff from the real kernels of nourishment. You may discover that what you get from skimming is all the book is worth to you for the time being. It may never be worth more. But you will know at least what the author's main contention is, as well as what kind of book he has written, so the time you have spent looking through the book will not have been wasted.

[...]

2. STUDY THE TABLE OF CONTENTS to obtain a general sense of the book's structure; use it as you would a road map before taking a trip. It is astonishing how many people never even glance at a book's table of contents unless they wish to look something up in it. In fact, many authors spend a considerable amount of time in creating the table of contents, and it is sad, to think their efforts are often wasted.

It used to be a common practice, especially in expository works, but sometimes even in novels and poems, to write very full tables of contents, with the chapters or parts broken down into many subtitles indicative of the topics covered. Milton, for example, wrote more or less lengthy headings, or "Arguments," as he called them, for each book of Paradise Lost. Gibbon published his Decline and Fall of the Roman Empire with an extensive analytical table of contents for each chapter. Such summaries are no longer common, although occasionally you do still come across an analytical table of contents. One reason for the decline of the practice may be that people are not so likely to read tables of contents as they once were. Also, publishers have come to feel that a less revealing table of contents is more seductive than a completely frank and open one. Readers, they feel, will be attracted to a book with more or less mysterious chapter titles-they will want to read the book to find out what the chapters are about. Even so, a table of contents can be valuable, and you should read it carefully before going on to the rest of the book.

[...]

3. CHECK THE INDEX if the book has one-most expository works do. Make a quick estimate of the range of topics covered and of the kinds of books and authors referred to. When you see terms listed that seem crucial, look up at least some of the passages cited. (We will have much more to say about crucial terms in Part Two. Here you must make your judgment of their importance on the basis of your general sense of the book, as obtained from steps 1 and 2.) The passages you read may contain the crux-the point on which the book hinges-or the new departure which is the key to the author's approach and attitude.

As in the case of the table of contents, you might at this point check the index of this book. You will recognize as crucial some terms that have already been discussed. Can you identify, for example, by the number of references under them, any others that also seem important?


Even just skimming an Index (or well-designed Table of Contents) can give you lots of helpful information.

This is why I mostly don't mind leaving unlinked indexes in ebooks (they don't hurt, and can only help, even in ways that pure search can't accomplish).


All times are GMT -4. The time now is 07:08 PM.

Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.