This is a key point. Physical Page Numbers are fully dependant on the book being in that exact format (that exact page size + page margins + fonts + font size + [...]). If you change one of those variables, all of the page numbers get thrown off.
This also brings up the problem of actual text cross-references used in physical books. Some might be in this form:
- "See Footnote 1 on page 252"
- An index might say "312n" (a footnote on page 312) or "312n5" (Footnote 5 on page 312)
This sort of text makes absolutely ZERO sense in an ebook.
More neutral text that makes sense in physical + ebook would be something like:
- "See Chapter 2, Footnote 20"
- "See Section 1.1"
- "See Footnote 2 in Section 1.1"
Or let us rant about one of my favorites... footnotes. In a physical book, footnotes may be numbered per page (so restarting from #1 each page). In a digital/other version, this numbering system becomes impossibly unwieldy. You might have links to 10 different "Footnote #1" in Chapter 2!
This requires the people typesetting/creating the physical book to be mindful of future/alternate formats. (Currently many publishers still just stick with the physical page number of hardcover or the highway!) But hopefully more awareness of this problem at least shifts that mindset to make the texts themselves more neutral/ebook friendly (such as numbering footnotes sequentially per chapter/book).
My gods... creating a proper index is EXPONENTIALLY more work than creating a simple dumb index (which already takes forever).
The "dumb index" (points right before the first word of that page), might get you a few ebook "page flips" away from the content. Depending on the density of the original physical pages, it could be ~400-800 words away.
As shalym mentioned, a more useful/thoroughly done "proper index" would point to the exact paragraph/sentence/word-level in which this reference occurred... but most people don't understand how... fracking... long... this... takes.
Creating an Index is so hard, and A HELL OF A LOT harder than it seems on the surface.
As an example: I am currently working on a "proper index" of a large non-fiction treatise (950 pages, ~400k words, Index: ~2.3k terms + ~5.1k links to page numbers). I already have the Index from the physical book (so "half the hard work is already done"). My current pace of converting this to a "proper index" is ~100 LINKS PER DAY. That means around 51 man-days of work (probably more).
Each and every link to a page number causes a cascade of extra work that you don't expect:
Easy Ones: These are easy: "Apologists, 48", ok, great, I reach page 48, and there is only 1 "apologists" on the entire page. Link the paragraph, problem solved!
These might take a few seconds to a minute.
Hard Ones: Hard ones are fracking HARD: "Ancestors, 3, 36, 145".
Great, I found the word "ancestors" in page 3, EASY. But wtf is this, I just read the entire page 36, and I don't see "ancestors" on the page.
You (as the converter) must now read/skim the ~400-800 words that constitute "page 36" to find what the Indexer ACTUALLY meant.
You have to look for all the related words: "ancestry" + "ancestor" + "ancestral". Maybe it just has an important sentence/paragraph that talks about ancestors indirectly (maybe talking about older relatives, or ancient civilizations).
Hard #2: "Keynes, John Maynard, 429, 464, 467, 468n, 546n, 737, 771, 785, 787, 846".
Keynes might be mentioned multiple times on a page. It just so happened to be because of the way the physical book was laid out (page margins, font, [...]), that Keynes was mentioned in the first + last paragraph on page 429, BUT, the middle paragraphs don't talk about him at all.
Where do I link? Do I link to that first paragraph? Do I link to the last paragraph too?
Keynes may also be mentioned quite a few times throughout the book on other pages, but it is just an unimportant/passing remark. This doesn't belong in the Index. In my searching/jumping around page numbers though, I STILL come across "Keynes" a hundred times, this takes time to sift through. (This is the problem of the Search/Concordance method + any sort of automated/semi-automated Indexing tools).
Hard #3: As Hitch mentioned, the same topic might be under multiple Index entries. This requires you to look through the Index and make sure all of THOSE links are the same as well. You don't want "Irish Setters" + "Setters, Irish" + "Sporting Dogs -> Setters -> Irish Setters" to point to different locations. This means you have to thoroughly (and I mean FRACKING THOROUGHLY) look through the Index when you are trying to create these things.
These hard ones take a minute+.
This book I am working on takes ~5 minutes on average per link (this takes into account double/triple-checking that the links are correct and a mistake was not made).
And this "proper index" I am working on is already "simpler". I already HAVE an index with page numbers on it, and I know the subject matter deeply (economics). Doing this as a business (at an ebook conversion house) would be IMPOSSIBLY expensive.
This rant didn't even tackle the subject of digitizing page RANGES in ebooks such as: "bilateral exchange, 794–796." Which paragraph should this entry end on? Well, I have to read those page 794-796 to find out! So you just think "Hey, it is just two lousy links/pages, how long could that take?" MINUTES!!!
Long story short, Indexing is an art, and it is fracking HARD (and very specialized).
Don't get ahead of yourself! X characters of WHAT?
- Of HTML?
- What about whitespace? Should it change if I "prettify" the HTML file?
- Of displayed characters?
- What about hidden code/text that only shows in one edition but not the other?
- Alt Text, Fleurons, stuff that shows in MOBI but not in KF8.
- Math (SVG or images or MathML).
- How would you treat things like the NCX? (In EPUB, you don't NEED an HTML TOC)?
- In future formats, there may be other different/easier files that make it easy to remove certain material (copyright page, title pages, Indexes, etc. etc.)
- What if a future format has text generated on the fly?
- For example, maybe in the future you just feed it an ISBN/DOI and it will generate a citation in a given format for you.
- What about poems? (You know, a whole book of those twenty word poems.)
- All of a sudden people will complain about the "10 page book" (10 pages = ~5000 words = 250 poems * 20 words)!

- What about Front/Back Matter? (Typically Front Matter is in Roman Numeral numbering).
- What happens when you move the "Front Matter" to the back of the ebook (such as the TOC?).
- Should ebooks have similar alternate numbering?
There are QUITE a few other citation styles:
https://en.wikipedia.org/wiki/Citation#Styles
While Harvard is one of the more popular ones, it depends mostly on which fields you are in. There is also ACS (typically used in Chemistry), AMS (Math), APA (Psychology), ASA (Sociology), Bluebook (Law), Chicago, IEEE (Engineering, Programming, Physics), MLA, Oxford, Turabian, Vancouver, [...].
Not necessarily... there IS a reason why they introduced rules on handling "websites" + "ebooks" + all the other digital resources besides "books". :P
If you point to a website, there is no fracking page number. I would say an ebook is much closer to all the other digital formats (website) than physical book.
Side Note: Also, a much more intelligent solution for generating bibliographies is with a database of information which gets fed into a template (which outputs the specific Citation Style you are using). You feed the tool information such as (Author, Title, Year, Publisher, ISBN, [...]), you tell it what type it is (Book, Journal, Website, [...]), and the tool generates the proper format for you.
This is the purpose of things like BibLaTeX or using things such as Wikipedia Citation Templates:
https://en.wikipedia.org/wiki/Wikipe...tion_templates
Even this I would say is changing.
Even in the academic world, more books are coming out in multiple forms:
- Print versions
- Hardcover + Paperback + Large Print
- Digital versions
- EPUB/MOBI
- HTML versions
- These potentially might have a lot more material than the print versions (video, audio, computer generated examples (think randomly generated math problems or graphs)).
- PDFs
- Most likely matches one of the Print Editions, but not necessarily (see HTML version above). Perhaps there might be more interaction in the PDF, or annotations, etc. etc.
Sure, you can have many who agree that "the hardcover print edition is where the page numbers come from". Sure, back during the stone ages when you only had Print/Large Print, or a Hardcover + Softcover, and you could insist that the pages = the hardcover, but sticking to those physical page numbers makes absolutely NO SENSE when you have multiple vastly differing digital formats.
And then that is just talking books. You may have something like an article that is standalone (PDF), reprinted on a site (HTML), plus the same article reproduced in a journal (different page sizes, margins, fonts, double-column, etc. etc.). Which page numbers do you INSIST on shoving onto the HTML version, the standalone's page numbers? The journal's? Which journal (the most prestigious?)?
And then this doesn't touch the purely digital texts (never physically printed, such as many self-published books).