![]() |
#16 | |
Bookmaker & Cat Slave
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 11,503
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
Hitch |
|
![]() |
![]() |
![]() |
#17 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 316
Karma: 3200000
Join Date: Oct 2015
Location: Madison, WI
Device: Kindle 5th Gen
|
I have not tried this (and frankly even if it works it still sounds pretty miserable) but I wonder if you could work page-by-page unlocking whatever master page element includes the page number, and then use a plugin such as https://www.rorohiko.com/wordpress/i...ds/textstitch/ to auto-thread the page numbers into the document flow.
|
![]() |
![]() |
![]() |
#18 | |
Bookmaker & Cat Slave
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 11,503
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
![]() Hitch |
|
![]() |
![]() |
![]() |
#19 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,764
Karma: 6000000
Join Date: Nov 2009
Device: many
|
If you have pdf of printed version of book, you should be able to print to a postscript file and use python on that postscript file to extract the page numbers and the first n words and last n words on each page (where n is small say 3) and save that info to a file. Then use sed or some other stream editor with that info to insert the markers you want in each html file.
Some custom programming in python might be needed but should be reusable for future projects. |
![]() |
![]() |
![]() |
#20 | ||||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Quote:
Quote:
Quote:
(Typical Hitch, never reading anything I write! ![]() Anyway, it's not anything I've ever done in an actual EPUB, just theoretical musings. All the logic is sound though. ![]() Luckily, I haven't had to do an Index in a very long time. |
||||
![]() |
![]() |
![]() |
#21 | |||
Connoisseur
![]() Posts: 55
Karma: 10
Join Date: Feb 2012
Device: none
|
Quote:
Quote:
I can see some potential problems with this, as the page numbers are on the bottom of the pages, and some pages are empty, which may confuse the issue, but that might be something I could prompt for. Quote:
-- Food for thought here folks, thanks a lot! |
|||
![]() |
![]() |
![]() |
#22 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,764
Karma: 6000000
Join Date: Nov 2009
Device: many
|
Yes a postscript (.ps) file is typically generated by a postscript printer driver (it is what is spooled and set to a postscript compatible printer) and contains the commands in text that the printer uses to make it actually print. There are many opensource (ghostscript, cups, pdf2ps on Linux) and commercial tools that can do this (Adobe Acrobat Pro). A ps file is very similar to a pdf file that has been unencoded, and uncompressed.
Extracting the info you need from one should be relatively easy. As for missing page labels (or numbers) these can easily be filled in and interpolated correctly from known page labels or values using a simple spreadsheet or software as it is a one-to-one mapping. If you can use pdf2ps on your machine it should be easy enough to look at the postscript file in a text editor and look for "showpage" and decide for yourself how hard it would be to extract just what you need. Last edited by KevinH; 12-04-2020 at 07:12 PM. |
![]() |
![]() |
![]() |
#23 | ||
Bookmaker & Cat Slave
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 11,503
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
Anyway, it's not anything I've ever done in an actual EPUB, just theoretical musings. All the logic is sound though. ![]() And there it is. :-) That stuff is just bloody tedious. I think it would be fun to write programming or clips, etc., to do it...but HAVING to do it, commercially, is the dog's south end. Quote:
Hitch |
||
![]() |
![]() |
![]() |
#24 | |
Connoisseur
![]() Posts: 55
Karma: 10
Join Date: Feb 2012
Device: none
|
Quote:
Acrobat also exports to txt, rtf, doc, docx etc, so I could imagine writing a python script that analyses such a file. That might give me the strings I could use to iterate through an epub html file and add anchor tags, that I could then link the index entries to. I'd need to account for whitespace, the existence of potential html tags, and other things, probably, but this seems relatively straightforward. With some sophistication - as the indices feature page ranges and note numbers, too - I might be able to automate the whole thing. Seeing as there are thousands of pages - and thousands of index entries per volume - here, it's definitely worth a try. edit: oh, no, scratch that. There are footnotes, a lot of them, clouding the issue in the PDF2xxx output, which I need to disregard, without losing the numbered lists. Also, as the page numbers are in the footers. And of course there are also headers, which I should also disregard. At this point, perhaps I am better off just working from the PDF in the first place, which is not so bad all things considered. Last edited by Ryn; 12-05-2020 at 04:07 AM. Reason: new sh*t has come to light; she kidnapped herself, man |
|
![]() |
![]() |
![]() |
#25 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 5,727
Karma: 24031401
Join Date: Dec 2010
Device: Kindle PW2
|
@Ryn: Do you know that Sigil has a built-in index tool? To use it, all you have to do is generate a plain text file with index entries, e.g. index.txt, and select the following:
|
![]() |
![]() |
![]() |
#26 | |
Connoisseur
![]() Posts: 55
Karma: 10
Join Date: Feb 2012
Device: none
|
Quote:
Thing is, these are custom-built indexes with thousands of entries per volume. I doubt I could do remotely as good a job as the original indexers, who likely spent upward of forty hours on each index. Needless to say, these books were not created to make any profit whatsoever, and that also goes for the digital edition we're currently putting together. The foundation which has enlisted my help has as a core value the dissemination of these texts, and keeping them safely available for future generations. I generally dissuade clients from including indexes, but in this case I am willing to make an exception. And I personally resonate with the subject, so my participation is not a chore at all. That being said, I dislike unnecessary monotonous labor as much as most people, if not more, so being smart about it and using tech to my advantage, I'm all for that! |
|
![]() |
![]() |
![]() |
#27 | |
Connoisseur
![]() Posts: 55
Karma: 10
Join Date: Feb 2012
Device: none
|
Quote:
Thanks Becky! Now I just need to write me some python logic to rid myself of the task of manually linking the index to the pages. But that's the fun part ![]() Laura mentioned another script that might actually serve my purpose even better: LiveIndex, found here: https://www.id-extras.com/products/liveindex/ I'm mentioning it, just in case anyone else ever comes across a similar use case. Last edited by Ryn; 12-05-2020 at 10:04 AM. Reason: LiveIndex script mention |
|
![]() |
![]() |
![]() |
#28 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,764
Karma: 6000000
Join Date: Nov 2009
Device: many
|
I see you found your solution but for the record there are ps2txt and ps2ascii you can use to display these as well as this useful article I used in the past:
https://www.cs.waikato.ac.nz/~ihw/pa...tract-Text.pdf It prepends a short and sweet extra postscript function to the original postscript which redefines the show methods to give you text output that would be easier to parse. FWIW, I find working with "ps printer device" can extract text electronically that is very hard to get to in other ways without a scanner. |
![]() |
![]() |
![]() |
#29 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,764
Karma: 6000000
Join Date: Nov 2009
Device: many
|
And you need to worry that you are linking to the top of a "printed" page and not the word itself which may appear no where on the actual screen as that "printed page" will generally be much longer than the screen holds. So it will just get them close at best.
In many ways, a good search function replaces the need for indexes almost completely. |
![]() |
![]() |
![]() |
#30 | |
Connoisseur
![]() Posts: 55
Karma: 10
Join Date: Feb 2012
Device: none
|
Quote:
But... Whereas searching is active, it presupposes you know exactly what you are searching for, whereas you might not always know what you don't know. An Index, otoh, has done this work for you, and then some. A good index will have collated different locations pertaining to the way "angular momentum" pertains to "diesel engines," for example. (Not even sure that that is a thing, but allow me the liberty.) This passive searching allows for a deeper sense of discovery in books that are more encyclopedic in scope. Not relevant to the vast majority of books that reaches our devices, I would be the first to agree, but in some cases, very much a desirable addition. |
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
InDesign CC 2017 epub export question | ralphiedee | ePub | 5 | 11-24-2016 09:02 PM |
Export to ePub from InDesign CS5 | gardefjord | ePub | 42 | 10-29-2011 10:42 AM |
InDesign CS 5.5 Epub Export Problems | SamL | ePub | 1 | 09-16-2011 07:06 PM |
InDesign export as ePub? | Alda | General Discussions | 3 | 01-24-2011 12:59 PM |
EPUB Expert Needed: Cant properly export epub from InDesign | crottmann | ePub | 17 | 08-27-2010 10:23 AM |