View Single Post
Old 12-05-2020, 03:54 AM   #24
Ryn
Connoisseur
Ryn began at the beginning.
 
Posts: 55
Karma: 10
Join Date: Feb 2012
Device: none
Quote:
Originally Posted by KevinH View Post
Yes a postscript (.ps) file is typically generated by a postscript printer driver (it is what is spooled and set to a postscript compatible printer) and contains the commands in text that the printer uses to make it actually print. There are many opensource (ghostscript, cups, pdf2ps on Linux) and commercial tools that can do this (Adobe Acrobat Pro). A ps file is very similar to a pdf file that has been unencoded, and uncompressed.

Extracting the info you need from one should be relatively easy. As for missing page labels (or numbers) these can easily be filled in and interpolated correctly from known page labels or values using a simple spreadsheet or software as it is a one-to-one mapping.

If you can use pdf2ps on your machine it should be easy enough to look at the postscript file in a text editor and look for "showpage" and decide for yourself how hard it would be to extract just what you need.
I have looked at this, but both pdf2ps and acrobat's postscript output are heavy on code, and feature encrypted text, rendering those avenues relatively useless.

Acrobat also exports to txt, rtf, doc, docx etc, so I could imagine writing a python script that analyses such a file.

That might give me the strings I could use to iterate through an epub html file and add anchor tags, that I could then link the index entries to. I'd need to account for whitespace, the existence of potential html tags, and other things, probably, but this seems relatively straightforward.

With some sophistication - as the indices feature page ranges and note numbers, too - I might be able to automate the whole thing.

Seeing as there are thousands of pages - and thousands of index entries per volume - here, it's definitely worth a try.

edit: oh, no, scratch that. There are footnotes, a lot of them, clouding the issue in the PDF2xxx output, which I need to disregard, without losing the numbered lists. Also, as the page numbers are in the footers. And of course there are also headers, which I should also disregard. At this point, perhaps I am better off just working from the PDF in the first place, which is not so bad all things considered.

Last edited by Ryn; 12-05-2020 at 04:07 AM. Reason: new sh*t has come to light; she kidnapped herself, man
Ryn is offline   Reply With Quote