MobileRead Forums - View Single Post

Tex2002ans · 11-14-2019, 10:59 PM

Quote:

Originally Posted by jhowell

Kfxgen restricts page labels to be either integers or roman numerals. Anything else in the source EPUB will be rejected. Even a single incorrect entry will cause the entire pagemap/pagelist to be rejected resulting in no page numbers being shown at all.

Hmmm. Now this is something I've never even thought of before:

How would digitization of newspapers/magazines/journals occur?

Think old school "Continued on A5", or even more egregious, where articles were split in halfs/thirds and located in completely incongruent page numbers:

Article 1:

- A1
- A6
- A9

Article 2:

- A1
- A6-A8

Also, is it even valid to have two of the same exact <a id="page"> in a pagemap/pagelist? (I doubt it.)

The only time I ever worked on something like this was a journal that ran over multiple decades... but that still had pages numbered 1-### (but did have articles split across incongruent pages).

Quote:

Originally Posted by MGlitch

Fractional page numbering, as describe in the OP, ties a physical page of a book to a "page" of an ebook, with the exception apparently of KFX8.

Hmmm... although I'm still not seeing the compelling argument for these in-between fractions.

It's just adding yet another arbitrary point + it's still going to have all the same issues as non-fractional pages (if font size gets big enough and/or device small enough, even these "fractional pages" won't change).

I can see the argument for non-integer and non-roman-numeral pages... but to add decimals "just because"? It's not selling me...

Quote:

Originally Posted by leebase

I disagree. It's the lifetime of the numbers in paper books that my mind understands when I hear a book has "600 pages".

My ebook can have 600 pages or 1800 pages depending on what I set the font size too....and change again if I switch from my iPhone to my iPad.

Density per page in Physical Books varies WILDLY. They're completely depend on font/page size, margins, etc. (Usual book can go anywhere from 150-600+ words per page. If you push columns, density can even go 800+.)

I agree with JSWolf on one thing, that's a huge advantage for Byte-Methods (especially ADE-type algorithms that compress) + raw Word Counts.

Side Note on Terminology: I would recommend being more specific with your wording:

Pages = physical book pages
Screens = "pages" on a specific device's monitor

I discussed this in-depth in the "Real Page Numbers" thread linked above. People interchangeably use "pages" to mean 3+ different things at the same time. (It's why I think Amazon's "Locations" is a better term, and ADE's "Pages" was a poor choice.)

Side Note On Physical Density:

Here's a 1000 page book I've been retypesetting. This very minor tweak in density can change it to 940 pages:

Click image for larger version

Name: Page.Density[Orig].png
Views: 297
Size: 87.9 KB
ID: 174939

Click image for larger version

Name: Page.Density[MoreDense].png
Views: 291
Size: 89.1 KB
ID: 174938

Even when working with the same exact text... a typesetter can aim towards X amount of pages, then tweak all the variables to force that goal in mind:

Bigger gap between page number + bottom of text
larger gap between header + top of text
expand margins
More hyphenation
Microtypography (Stretching/Shrinking characters)
[...]

Minor Margins Side Note: Back in 2014 I showed how a teeny change in footnote margins from 1.2em to 1.1em could cause two words—"is of"—to be pulled to the previous line and cause a cascading effect throughout a chapter:

https://www.mobileread.com/forums/sh...94#post2976294

Quote:

Originally Posted by leebase

FYI - the "switching to word counts" was my solution to MY use case: how big is this book?

For many cases, I would say "word count" accuracy >> pages...

It's why most of the (English) publishing/editing industry settled on standard manuscript pages = "250/300 words per page".

But again, know that each method has its own serious flaws.

"Word Count" might be a-okay for (English) Fiction, but Non-Fiction, so many more edge cases creep in...

URLs

Becoming more and more and more relevant since the internet:

Code:

http://www.example.com/123.web/article12345.html

1 word? 8 words?

Code:

<a href="http://www.example.com/123.web/article12345.html">Article Title</a>

2 words? 10 words?

In Print books, URLs need to be completely typed out, but in ebooks, these can be hidden in a clickable URL.

(I tested URLs in Word, as long as there's not a space, it considers the entire "http://[...].html" 1 word.)

One of the latest Non-Fiction books I worked on had 1000+ footnotes with 1.5k+ URLs. Depending on how you answer this can vastly change the "Word Count" of the book:

Here's that book's "Word Count":

123,275 (Sigil)
125,959 (Word 2016)

If you were counting Physical Pages? Well, let's just say URLs take up a huge amount of space.

(In that book, after removing URLs >50 pages in Word disappeared.)

Slashes (Related to URLs)

Code:

The backwards/forward slash.

3 words? 4?

(Word considers it 3.)

I would strongly lean towards it being 4 different words.

Images (Alt Text)

Code:

<img alt="Photograph of George Washington" src="../George.jpg" />

0 words? 4 words?

Alt Text is read/displayed with Text-to-Speech. There can be a ton of hidden information here, but again, which format you're in leads to different "word counts".

Physical book it looks just like an image.
Ebook it looks like an image, but the text is hidden.
In Audiobook (or Text-to-Speech), these are actual words
- (although in the audiobook they may SKIP over image+captions!).

Emojis

(I was actually thinking of this one a few days ago. I know Hitch has brought this up a few times over the years: these characters are becoming EVEN MORE and more prevalent in actual novels...)

🧛*♂️

Is this 1 or 2 words?

Vampire?
Dracula?
Man Vampire?
- In its encoding, it's a VAMPIRE (U+1F9DB) + MALE SIGN (U+2642). Depending on your program, it might display as 1 or 2 separate characters.

Or maybe it's even 0 words?

👫

Is this 1 or 5 words?

Man and Woman Holding Hands?

(This isn't even bringing all the skin color additions...)

(I tested in Word, it only counts an emoji with spaces around it as a word, so ⚽⚾🏈🏀 = "1 word".)

Superscripts

Code:

This is an example.<sup>1</sup>
The molecule for water is H<sub>2</sub>O.
Answer is x<sup>power</sup><sub>subscript</sub>

Is what's in the super/subscripts a separate word?

Bibliographies/Indexes

Might add a lot of "words", but be extremely dense. (Very few physical pages to get A TON of information across.)

Complete Side Note: A lot of this reminds me of discussion surrounding "How many words are there in the English language?"

See Merriam Webster's article "How many words are there in English?":

Quote:

There is no exact count of the number of words in English, and one reason is certainly because languages are ever expanding; [...] Consider such words as "cannoli" and "teriyaki," which come from other tongues but are established through use, context, and frequency as English. There are many other thorny considerations that complicate the task of counting individual words and tallying up the language in that way. For example, are all of the inflected forms of a word–for instance, "drive," "drives," "drove," etc.–one word or several separate words?

Similarly, there are twelve different words with the spelling "post" entered in Webster's Third New International Dictionary, Unabridged; they all have different parts of speech or derivations. Should these twelve be considered one word for the purposes of our reckoning? Some scholars would insist the distinct forms of "post" only be counted once, but others consider each one a separate word that should be counted individually.

Another puzzle: should "port of call," another Webster's Third entry, count as a word, even though each of its components is entered separately?

[...]

You would think it's easy (just count how many words in the dictionary!), but it's an impossibly hard problem, and depends completely on methodology. Different assumptions will lead to huge fluctuations in outcome.

Quote:

Originally Posted by JSWolf

Again so very very very wrong. The idea is not to have the same page numbers between eBooks and pBooks. The idea is to have consistent page numbers for eBooks and ADE page numbers is that consistent page number system.

We've discussed this extensively back in 2017:

Citing Websites

(It seems like this "Page Numbers" talk bubbles up about once every year or two and explodes into an enormous volcano! :P)

I just want to say... those Byte-based algorithms can easily be thrown off by substituting HTML Entities or Pretty Print:

The amount of bytes for:

— =/= — =/= —
“ ” =/= “ ” =/= “ ”

I just grabbed Alice in Wonderland off Project Gutenberg:

ADE Pages:

66 (Original EPUB)
67 (Pretty Print)
68 (Pretty Print + HTML Entities)

You can imagine how a larger book would fare. It could be dozens of "ADE Pages" off.

(I attached the EPUBs below.)

Pre-empting JSWolf's Answer: Yes, I know... according to you, "everything should be Pretty Printed and using actual UTF-8 characters".

But seriously... you have got to stop this ADE+JSWolf's way is the one, true way!

Quote:

Originally Posted by j.p.s

It is very common for ebooks to be published with anchors and references based on page numbers in a print edition and is at the very core of this thread's subject matter.

And I think it's important to stress again, that there's multiple different overlaps happening at once.

You have the things that are:

born-print and needs to be digitized
purely born-digital
- Websites + new documents
and anything in-between

None of those counting solutions will work perfectly for all cases... but one thing's for certain, page numbers have been around for hundreds of years, and they're probably not going anywhere (no matter how much me and JSWolf want everything to be purely digital). :P