How to export indexes from indesign to epub - Page 2

Hitch · 12-04-2020, 02:49 PM

Quote:

Originally Posted by BeckyEbook

I'm not sure if this is exactly what you need, but I will post a few links that may lead you to come up with your own solution.

http://epubsecrets.com/why-i-use-page-list-and-how.php
http://epubsecrets.com/page-list-all...e-doing-it.php

The link to the script is dead, so I'm listing it from web archive:
http://web.archive.org/web/201912181...orohikoscripts

You can write directly to Laura, but I have a feeling you'd better check out the "EPUB Accessibility Using InDesign" video tutorial (available from Lynda.com or Linkedin), which AFAIK includes the PageStaker and EPUBOgrify script.
The latter is not so important anyway, because it is a simple change that can be done in Sigil.

See, there you go. I knew somebody would have some clips.

Hitch

phillipgessert · 12-04-2020, 04:04 PM

I have not tried this (and frankly even if it works it still sounds pretty miserable) but I wonder if you could work page-by-page unlocking whatever master page element includes the page number, and then use a plugin such as https://www.rorohiko.com/wordpress/i...ds/textstitch/ to auto-thread the page numbers into the document flow.

Hitch · 12-04-2020, 04:29 PM

Quote:

Originally Posted by phillipgessert

I have not tried this (and frankly even if it works it still sounds pretty miserable) but I wonder if you could work page-by-page unlocking whatever master page element includes the page number, and then use a plugin such as https://www.rorohiko.com/wordpress/i...ds/textstitch/ to auto-thread the page numbers into the document flow.

Oh, Phillip! Who knew you had such masochistic leanings? Quick--what's your safe word?

Hitch

KevinH · 12-04-2020, 04:39 PM

If you have pdf of printed version of book, you should be able to print to a postscript file and use python on that postscript file to extract the page numbers and the first n words and last n words on each page (where n is small say 3) and save that info to a file. Then use sed or some other stream editor with that info to insert the markers you want in each html file.

Some custom programming in python might be needed but should be reusable for future projects.

Tex2002ans · 12-04-2020, 05:39 PM

Quote:

Originally Posted by BeckyEbook

I'm not sure if this is exactly what you need, but I will post a few links that may lead you to come up with your own solution.

Fantastic. Thanks for sharing.

Quote:

Originally Posted by Ryn

What I really wanted to know was: how do you export the page numbers in an indesign export to epub? It's not an option in the export dialogue box, nor is it something I can easily put together using indesign's byzantine search module.

Is there another way?

BeckyEbook's links would likely work as well. Those 2018 articles are probably better + more modern than the older articles I linked in my Post #4.

Quote:

Originally Posted by Hitch

You don't. There isn't any easy or magic or automatic way to export the RPNs (Real Page Numbers). You create them manually.

Yep, you would think it would be a checkbox in InDesign... especially with how much Adobe talks Accessibility.

Quote:

Originally Posted by Hitch

Knowing Tex, he has some mad coding that will do some of this more easily than I've described, but that's the fundamental process, right there.

I see you didn't read all the links in my earlier post #4!

(Typical Hitch, never reading anything I write!

)

Anyway, it's not anything I've ever done in an actual EPUB, just theoretical musings. All the logic is sound though.

Luckily, I haven't had to do an Index in a very long time.

Ryn · 12-04-2020, 05:54 PM

Quote:

Originally Posted by BeckyEbook

I'm not sure if this is exactly what you need, but I will post a few links that may lead you to come up with your own solution.

http://epubsecrets.com/why-i-use-page-list-and-how.php
http://epubsecrets.com/page-list-all...e-doing-it.php

The link to the script is dead, so I'm listing it from web archive:
http://web.archive.org/web/201912181...orohikoscripts

You can write directly to Laura, but I have a feeling you'd better check out the "EPUB Accessibility Using InDesign" video tutorial (available from Lynda.com or Linkedin), which AFAIK includes the PageStaker and EPUBOgrify script.
The latter is not so important anyway, because it is a simple change that can be done in Sigil.

Thank you for digging up the web archive link for me. This might be just what I'm looking for.

Quote:

Originally Posted by KevinH

If you have pdf of printed version of book, you should be able to print to a postscript file and use python on that postscript file to extract the page numbers and the first n words and last n words on each page (where n is small say 3) and save that info to a file. Then use sed or some other stream editor with that info to insert the markers you want in each html file.

Some custom programming in python might be needed but should be reusable for future projects.

This might also be something I'd consider doing, seeing how big the project is, and how much I loathe working from PDFs. By postscript file, do you mean a text file?

I can see some potential problems with this, as the page numbers are on the bottom of the pages, and some pages are empty, which may confuse the issue, but that might be something I could prompt for.

Quote:

Originally Posted by phillipgessert

I have not tried this (and frankly even if it works it still sounds pretty miserable) but I wonder if you could work page-by-page unlocking whatever master page element includes the page number, and then use a plugin such as https://www.rorohiko.com/wordpress/i...ds/textstitch/ to auto-thread the page numbers into the document flow.

Or I could try my hand at programming an indesign plugin for this express purpose. How hard could it be to get a script to recognize the page numbers, and to cross-reference the indexed page numbers to the first word on the relevant page? Famous last words, I'm sure...

--
Food for thought here folks, thanks a lot!

KevinH · 12-04-2020, 06:37 PM

Yes a postscript (.ps) file is typically generated by a postscript printer driver (it is what is spooled and set to a postscript compatible printer) and contains the commands in text that the printer uses to make it actually print. There are many opensource (ghostscript, cups, pdf2ps on Linux) and commercial tools that can do this (Adobe Acrobat Pro). A ps file is very similar to a pdf file that has been unencoded, and uncompressed.

Extracting the info you need from one should be relatively easy. As for missing page labels (or numbers) these can easily be filled in and interpolated correctly from known page labels or values using a simple spreadsheet or software as it is a one-to-one mapping.

If you can use pdf2ps on your machine it should be easy enough to look at the postscript file in a text editor and look for "showpage" and decide for yourself how hard it would be to extract just what you need.

Hitch · 12-04-2020, 07:25 PM

Quote:

Originally Posted by Tex2002ans

I see you didn't read all the links in my earlier post #4!

(Typical Hitch, never reading anything I write!

)

Oh, you are being a [something or other]. You know I do, in fact, read all your stuff. (Go ahead, I defy you to find other human beings that do!). I just don't always remember every single thing. And...well, yes, I may on occasion skim. It's not like you write 300-word Blogger posts, now, is it. You're like the Anti-Twitter. You're the one guy I can always count on to make me look terse, my brother.

Anyway, it's not anything I've ever done in an actual EPUB, just theoretical musings. All the logic is sound though.

[/QUOTE]

And there it is. :-) That stuff is just bloody tedious. I think it would be fun to write programming or clips, etc., to do it...but HAVING to do it, commercially, is the dog's south end.

Quote:

Luckily, I haven't had to do an Index in a very long time.

You ARE lucky!

Hitch

Ryn · 12-05-2020, 03:54 AM

Quote:

Originally Posted by KevinH

Yes a postscript (.ps) file is typically generated by a postscript printer driver (it is what is spooled and set to a postscript compatible printer) and contains the commands in text that the printer uses to make it actually print. There are many opensource (ghostscript, cups, pdf2ps on Linux) and commercial tools that can do this (Adobe Acrobat Pro). A ps file is very similar to a pdf file that has been unencoded, and uncompressed.

Extracting the info you need from one should be relatively easy. As for missing page labels (or numbers) these can easily be filled in and interpolated correctly from known page labels or values using a simple spreadsheet or software as it is a one-to-one mapping.

If you can use pdf2ps on your machine it should be easy enough to look at the postscript file in a text editor and look for "showpage" and decide for yourself how hard it would be to extract just what you need.

I have looked at this, but both pdf2ps and acrobat's postscript output are heavy on code, and feature encrypted text, rendering those avenues relatively useless.

Acrobat also exports to txt, rtf, doc, docx etc, so I could imagine writing a python script that analyses such a file.

That might give me the strings I could use to iterate through an epub html file and add anchor tags, that I could then link the index entries to. I'd need to account for whitespace, the existence of potential html tags, and other things, probably, but this seems relatively straightforward.

With some sophistication - as the indices feature page ranges and note numbers, too - I might be able to automate the whole thing.

Seeing as there are thousands of pages - and thousands of index entries per volume - here, it's definitely worth a try.

edit: oh, no, scratch that. There are footnotes, a lot of them, clouding the issue in the PDF2xxx output, which I need to disregard, without losing the numbered lists. Also, as the page numbers are in the footers. And of course there are also headers, which I should also disregard. At this point, perhaps I am better off just working from the PDF in the first place, which is not so bad all things considered.

Doitsu · 12-05-2020, 06:16 AM

@Ryn: Do you know that Sigil has a built-in index tool? To use it, all you have to do is generate a plain text file with index entries, e.g. index.txt, and select the following:

Tool > Index > Index Editor...
Right-click > Open > index.txt > Save
Tool > Index > Create Index

Obviously the index entries won't have page numbers, but you might be able to add them later with a custom Python script or a plugin.

Ryn · 12-05-2020, 07:54 AM

Quote:

Originally Posted by Doitsu

@Ryn: Do you know that Sigil has a built-in index tool? To use it, all you have to do is generate a plain text file with index entries, e.g. index.txt, and select the following:

Tool > Index > Index Editor...
Right-click > Open > index.txt > Save
Tool > Index > Create Index

Obviously the index entries won't have page numbers, but you might be able to add them later with a custom Python script or a plugin.

Hi there Doitsu, yeah I did know that. And I have considered using it.

Thing is, these are custom-built indexes with thousands of entries per volume. I doubt I could do remotely as good a job as the original indexers, who likely spent upward of forty hours on each index.

Needless to say, these books were not created to make any profit whatsoever, and that also goes for the digital edition we're currently putting together. The foundation which has enlisted my help has as a core value the dissemination of these texts, and keeping them safely available for future generations.

I generally dissuade clients from including indexes, but in this case I am willing to make an exception. And I personally resonate with the subject, so my participation is not a chore at all.

That being said, I dislike unnecessary monotonous labor as much as most people, if not more, so being smart about it and using tech to my advantage, I'm all for that!

Ryn · 12-05-2020, 08:26 AM

Quote:

Originally Posted by BeckyEbook

I'm not sure if this is exactly what you need, but I will post a few links that may lead you to come up with your own solution.

http://epubsecrets.com/why-i-use-page-list-and-how.php
http://epubsecrets.com/page-list-all...e-doing-it.php

The link to the script is dead, so I'm listing it from web archive:
http://web.archive.org/web/201912181...orohikoscripts

You can write directly to Laura, but I have a feeling you'd better check out the "EPUB Accessibility Using InDesign" video tutorial (available from Lynda.com or Linkedin), which AFAIK includes the PageStaker and EPUBOgrify script.
The latter is not so important anyway, because it is a simple change that can be done in Sigil.

This turns out to work like a charm, even when exporting to ePub 2, which I feared might be a problem.

Thanks Becky!

Now I just need to write me some python logic to rid myself of the task of manually linking the index to the pages. But that's the fun part

Laura mentioned another script that might actually serve my purpose even better: LiveIndex, found here: https://www.id-extras.com/products/liveindex/

I'm mentioning it, just in case anyone else ever comes across a similar use case.

KevinH · 12-05-2020, 09:56 AM

I see you found your solution but for the record there are ps2txt and ps2ascii you can use to display these as well as this useful article I used in the past:

https://www.cs.waikato.ac.nz/~ihw/pa...tract-Text.pdf

It prepends a short and sweet extra postscript function to the original postscript which redefines the show methods to give you text output that would be easier to parse.

FWIW, I find working with "ps printer device" can extract text electronically that is very hard to get to in other ways without a scanner.

KevinH · 12-05-2020, 10:01 AM

And you need to worry that you are linking to the top of a "printed" page and not the word itself which may appear no where on the actual screen as that "printed page" will generally be much longer than the screen holds. So it will just get them close at best.

In many ways, a good search function replaces the need for indexes almost completely.

Quote:

Originally Posted by Ryn

This turns out to work like a charm, even when exporting to ePub 2, which I feared might be a problem.

Thanks Becky!

Now I just need to write me some python logic to rid myself of the task of manually linking the index to the pages. But that's the fun part

Ryn · 12-05-2020, 10:12 AM

Quote:

Originally Posted by KevinH

And you need to worry that you are linking to the top of a "printed" page and not the word itself which may appear no where on the actual screen as that "printed page" will generally be much longer than the screen holds. So it will just get them close at best.

In many ways, a good search function replaces the need for indexes almost completely.

Well, yes and no. Of course, I always use the exact same argument when attempting to dissuade clients from insisting on index inclusion.

But... Whereas searching is active, it presupposes you know exactly what you are searching for, whereas you might not always know what you don't know.

An Index, otoh, has done this work for you, and then some. A good index will have collated different locations pertaining to the way "angular momentum" pertains to "diesel engines," for example. (Not even sure that that is a thing, but allow me the liberty.)

This passive searching allows for a deeper sense of discovery in books that are more encyclopedic in scope.

Not relevant to the vast majority of books that reaches our devices, I would be the first to agree, but in some cases, very much a desirable addition.

12-04-2020, 06:37 PM	#22
KevinH Sigil Developer Posts: 9,743 Karma: 6774572 Join Date: Nov 2009 Device: many	Yes a postscript (.ps) file is typically generated by a postscript printer driver (it is what is spooled and set to a postscript compatible printer) and contains the commands in text that the printer uses to make it actually print. There are many opensource (ghostscript, cups, pdf2ps on Linux) and commercial tools that can do this (Adobe Acrobat Pro). A ps file is very similar to a pdf file that has been unencoded, and uncompressed. Extracting the info you need from one should be relatively easy. As for missing page labels (or numbers) these can easily be filled in and interpolated correctly from known page labels or values using a simple spreadsheet or software as it is a one-to-one mapping. If you can use pdf2ps on your machine it should be easy enough to look at the postscript file in a text editor and look for "showpage" and decide for yourself how hard it would be to extract just what you need. Last edited by KevinH; 12-04-2020 at 07:12 PM.

12-05-2020, 06:16 AM	#25
Doitsu Grand Sorcerer Posts: 5,831 Karma: 24222221 Join Date: Dec 2010 Device: Kindle PW2	@Ryn: Do you know that Sigil has a built-in index tool? To use it, all you have to do is generate a plain text file with index entries, e.g. index.txt, and select the following: Tool > Index > Index Editor... Right-click > Open > index.txt > Save Tool > Index > Create Index Obviously the index entries won't have page numbers, but you might be able to add them later with a custom Python script or a plugin.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
InDesign CC 2017 epub export question	ralphiedee	ePub	5	11-24-2016 09:02 PM
Export to ePub from InDesign CS5	gardefjord	ePub	42	10-29-2011 10:42 AM
InDesign CS 5.5 Epub Export Problems	SamL	ePub	1	09-16-2011 07:06 PM
InDesign export as ePub?	Alda	General Discussions	3	01-24-2011 12:59 PM
EPUB Expert Needed: Cant properly export epub from InDesign	crottmann	ePub	17	08-27-2010 10:23 AM

12-04-2020, 04:04 PM	#17
phillipgessert Addict Posts: 332 Karma: 3200122 Join Date: Oct 2015 Location: Madison, WI Device: Kindle 5th Gen	I have not tried this (and frankly even if it works it still sounds pretty miserable) but I wonder if you could work page-by-page unlocking whatever master page element includes the page number, and then use a plugin such as https://www.rorohiko.com/wordpress/i...ds/textstitch/ to auto-thread the page numbers into the document flow.

12-04-2020, 04:39 PM	#19
KevinH Sigil Developer Posts: 9,743 Karma: 6774572 Join Date: Nov 2009 Device: many	If you have pdf of printed version of book, you should be able to print to a postscript file and use python on that postscript file to extract the page numbers and the first n words and last n words on each page (where n is small say 3) and save that info to a file. Then use sed or some other stream editor with that info to insert the markers you want in each html file. Some custom programming in python might be needed but should be reusable for future projects.

12-05-2020, 09:56 AM	#28
KevinH Sigil Developer Posts: 9,743 Karma: 6774572 Join Date: Nov 2009 Device: many	I see you found your solution but for the record there are ps2txt and ps2ascii you can use to display these as well as this useful article I used in the past: https://www.cs.waikato.ac.nz/~ihw/pa...tract-Text.pdf It prepends a short and sweet extra postscript function to the original postscript which redefines the show methods to give you text output that would be easier to parse. FWIW, I find working with "ps printer device" can extract text electronically that is very hard to get to in other ways without a scanner.