PDF is not an eBook format - Page 32

pdurrant · 09-02-2009, 09:35 AM

There is an easy solution to one of the hyphenation problems. When the ebook is being created, optional hyphens should be added by the creator to all the valid hyphenation positions in all the words in the book.

The creator would use software to automate the insertion of the hyphens, and said software ask for help on words it hadn't already been told about, or which might have different hyphenations depending on context. This hyphenation position marking should be a quick and simple process, as most words will already be in the software's hyphenation dictionary. It can even suggest hyphenations to the operator for unknown words using language-specific algorithms. Most of the time the operator will just have to agree.

The problem of valid hyphenation positions is now solved - the rendering software needs no intelligence, it has only to hyphenate at the optional hyphen positions.

Quote:

Originally Posted by ahi

Keep, reading, I guess. (Note that "not machine-solved yet" != "machine-solvable".)

- Ahi

Ps.: Or why don't I help you out...

WillAdams · 09-02-2009, 09:47 AM

jbjb,

The appropriate cite for ``typography is not a machine solvable problem'' would be the Knuth-Plass paper ``Breaking Paragraphs into Lines'', D.E. Knuth and M.F. Plass, chapter 3 of _Digital Typography_, CSLI Lecture Notes #78.

Please note that there is no H&J algorithm which can successfully detect and prevent ``stacks'' or rivers --- it seems to be (to use the formal computing term) ``NP Complete'' --- I'd be very interested in any research or algorithm which makes this a solvable problem.

There're even fewer efforts to solve typographic problems at a level larger than a page --- and I've frequently had to relay an entire chapter because of how the last page fell out --- Here's a list of the current research on this:

http://groups.google.com/group/comp....9?dmode=source

So, unless someone has an example of an implementation which will automatically paginate a text and _not_ allow stacks, orphans or other bad breaks, I believe that the above references should stand as the requested citation to demonstrate that, ``typography is not a machine solvable problem''.

William

WillAdams · 09-02-2009, 09:49 AM

Pdurrant --- if you think inserting all possible appropriate hyphens is easy, please provide a rate quote for doing this to text on a per file basis per thousand characters of text. Please note that ``present'' has different hyphenation points depending on its pronounciation (whether it's a gift or the act of presenting something or the current time) and that any method for doing this would need to take into account any such words and only insert the appropriate and correct hyphenation.

William

jbjb · 09-02-2009, 09:55 AM

Quote:

Originally Posted by ahi

Keep, reading, I guess. (Note that "not machine-solved yet" != "machine-solvable".)

- Ahi

Ps.: Or why don't I help you out...

You are missing the point (in fact, several points). At the most pedantic, not machine-solvable (as opposed to not yet machine-solved) is a very specific assertion, and needs some concrete logical proof of non-computability.

Furthermore, the problem itself is sufficiently ill-defined that the assertion is meaningless. As has been pointed out to you several times, different people have different opinions of the level of typography required for the problem to be classed as "solved".

I assume you'd be happy to concede that there is no single perfect typographical layout that would be universally recognised, and that even "experts" would disagree about which was superior out of a selection of hand-made layouts? Given that, if the criterion for success is thge perfect layout, then you could trivially say that the problem is not machine-solvable, but it's also unsolvable period.

The only meaningful (it seems to me) yardstick for "solved" is when the typography is sufficiently good that the reader is completely happy with it. Different people will have different thresholds for this, and many people's can be met with an automated solution. Furthermore, and this is key, just because you claim expertise in this field doesn't make your opinion of what is acceptable to any given reader any more valid than their own. Different people want different things.

/JB

Abecedary · 09-02-2009, 10:00 AM

Quote:

Originally Posted by WillAdams

Pdurrant --- if you think inserting all possible appropriate hyphens is easy, please provide a rate quote for doing this to text on a per file basis per thousand characters of text.

William

And that's not even mentioning that you're not having a machine solve the problem, which is what this is supposedly taking care of.

I find the thought of inserting at least 5 extra characters into every single multisyllabic word to be a ridiculously poor way of handling the problem. For one, it would make the HTML nearly unreadable due to the number of extra markups involved (not exactly a huge problem, but one that could certainly be a major nuisance to some). And let's not even think about what it would do to the filesizes.

ahi · 09-02-2009, 10:11 AM

Quote:

Originally Posted by Abecedary

And that's not even mentioning that you're not having a machine solve the problem, which is what this is supposedly taking care of.

I find the thought of inserting at least 5 extra characters ('*') into every single multisyllabic word to be a ridiculously poor way of handling the problem. For one, it would make the HTML nearly unreadable due to the number of extra markups involved (not exactly a huge problem, but one that could certainly be a major nuisance to some). And let's not even think about what it would do to the filesizes.

Well... at least it shifts the weight of the document's typographic preparation where it belongs: the bookmaker's shoulders.

Even if genuinely solving the hyphenation problem was possible and practical, would all eBook readers include all hyphenation patterns for all languages?

All popular languages.
All less popular languages.
All languages that have no states attached to them? (Or at least the ones with more speakers than some of the world's smaller countries.)

... and, of course, some of these languages will be such that they will have words whose meaning, and therefore correct hyphenation, depends entirely on the semantic context. Which takes us to the eBook reading software having to try to at least figure out the grammar for X number of languages (where X might be very large indeed).

The truth is, it would be easier (and quite possibly achievable with remarkably high degree of accuracy) to have eBook readers replace dumb quotes with smart quotes in ePubs on the run... But I don't see that ever happening. And until it does, or something very like it is successfully implemented, I find it difficult to take seriously any suggestion that eBook readers will ever even try to hyphenate in a way that has any chance of producing correct hyphenation for a genuinely large percentage of all eBooks (regardless of language).

But, of course, I do not believe it is possibel for them to actually succeed, even if they do try.

- Ahi

frabjous · 09-02-2009, 10:17 AM

Wow, such overreaction.

LaTeX already knows the hyphenation of most words, as already been not only stated but demonstrated. The fact that a million words exist, the vast majority of which are almost never used is completely irrelevant. Including the results of a script that finds exceptions to what the system knows, and identifies those that are known to be ambiguous (though grammar-check-like software could handle most such cases) to be given to the book designer at book creation to mark just these words likely wouldn't take more than 10 minutes per book.

If the trade-off is between no hyphenation anywhere, or a non-reflowable format, vs the occasionally wrong pre-sent vs. pres-ent, I certainly would put up with the latter.

There's no reason in principle why a computer can't identify stacks and rivers, and try to do something about them. LaTeX already can be programmed to completely avoid widows and orphans, though on a sufficiently small page, doing so would be a bad idea.

Maybe a perfect algorithm isn't possible, and a computer can't do these things perfectly, though I'm not sure they're humanly perfectable either, but let me just say this...

Using the failure of perfectability as an excuse for poo-pooing the push for software that does these things much better than what we currently have is well... incredibly silly.

ahi · 09-02-2009, 10:41 AM

Quote:

Originally Posted by jbjb

You are missing the point (in fact, several points). At the most pedantic, not machine-solvable (as opposed to not yet machine-solved) is a very specific assertion, and needs some concrete logical proof of non-computability.

Furthermore, the problem itself is sufficiently ill-defined that the assertion is meaningless. As has been pointed out to you several times, different people have different opinions of the level of typography required for the problem to be classed as "solved".

I have on most occasions, whether or not I have done so in the piece I quoted, stated that I believed typography to not be machine-solvable without human-level artificial intelligence. While strictly speaking the two statements are not equivalent, in practical terms they certainly are... Unless you anticipate "Rights for Robots" campaigns to start-up within the next few decades--and I don't.

The different opinions mostly come from individuals whose work and profession have nothing whatsoever to do with bookmaking. To be frank, I am not going to pretend they ought to be considered on equal footing with definitions of typography not debased by an ardent desire to only read HTML books in the future.

But let's not even worry about typography's solvability being defined, unless you have some cogent argument to make about how hyphenation-at-display-time can be solved in a way that works for practically (but, to be fair, not literally) all of humanity, not just the anglosphere or the western world. (Comprehensive [as opposed to superficial] hyphenation patterns for Gikuyu, anyone? [Presumably with autodetection of English and Swahili words included, to which their respective hyphenation ought to be applied.] So)

Quote:

Originally Posted by jbjb

I assume you'd be happy to concede that there is no single perfect typographical layout that would be universally recognised, and that even "experts" would disagree about which was superior out of a selection of hand-made layouts? Given that, if the criterion for success is thge perfect layout, then you could trivially say that the problem is not machine-solvable, but it's also unsolvable period.

The only people talking about perfect typography are the ones that prefer HTML books that can barely demonstrate any typography.

The fact that even experts cannot agree on perfect typography is irrelevant.

Experts and even reasonably intelligent and knowledgeable amateurs will be able to recognize good, high quality typography when they see it. But even this is irrelevant at this point.

The problem with display-time typography is that without perfect hyphenation (and sometimes even with perfect hyphenation) there may well be no straightforward way to render without committing blatant, obvious, and egregious typographic errors. And surely you will concede that whether or not experts can agree on what is perfect hyphenation, there are objective standards in most written languages as to what is incorrect hyphenation.

Quote:

Originally Posted by jbjb

The only meaningful (it seems to me) yardstick for "solved" is when the typography is sufficiently good that the reader is completely happy with it.

In the same way that the only yardstick for the value of a diamond ring is how well the Bride-to-be receives.

Quote:

Originally Posted by jbjb

Different people will have different thresholds for this, and many people's can be met with an automated solution. Furthermore, and this is key, just because you claim expertise in this field doesn't make your opinion of what is acceptable to any given reader any more valid than their own.

No. My actual level of expertise, which is moderate at best, however does. Much as my car-mechanic's musings of my car's road-worthiness are far more valid than my own, regardless of whether or not I am satisfied with its operation.

Quote:

Originally Posted by jbjb

Different people want different things.

Clearly. I don't think this has as thoroughly broad implications as you suggest though.

---

Having said all this... let me somewhat give you what you want me to say:

Yes, I do believe it is possible to get hyphenation and typography right enough for a lot of people to be satisfied. Primarily because it already happened, since a lot of people are fine with no hyphenation and utterly broken typography.

Any improvements will doubtless be welcomed and celebrated by those people as much as anyone. What I do not believe though, is that either hyphenation or typography in general can be gotten to a state where it is objectively of a professional quality (note, I did not say perfect).

- Ahi

ahi · 09-02-2009, 10:53 AM

Quote:

Originally Posted by frabjous

Wow, such overreaction.

Me?

Quote:

Originally Posted by frabjous

LaTeX already knows the hyphenation of most words, as already been not only stated but demonstrated.

If I understand correctly, LaTeX hyphenation patterns are not wordlists, but pattern lists. Which means there is no automatic way of identifying words for which correct hyphenation patterns are not known. (i.e.: meaning both words for which no hyphenation is possible with the given pattern-set, and words for which LaTeX's known hyphenation is actually incorrect)

In Hungarian, hyphenation of certain words (not an ennumaratably small list) depends on semantic context... literally no way to know the correct hyphenation without understanding the word/sentence.

In addition, the Hungarian double digraphs "ssz" (a long "sz"), "ccs" (a long "cs"), "zzs", "ggy", 'nny" are treated unorthodoxly. If "massza" is hyphenated as "masz-sza"... however "ssz" could also be "s+sz" as in "vasszarv" which is correctly hyphenated as "vas-szarv". The LaTeX solution is to manually mark double digraphs... so that if hyphenation needs to occur there, it is not mistakenly separated the wrong way. Oh... and, of course, this is also an issue with single digraphs. Is a "cs" sequence a digraph, or merely "c+s"--is a "sz" or a "zs" sequence a digraph or "s+z" or "z+s".

Tolerable hyphenation that is right most of the time will not forever be impossible to do at display-time. Professional hyphenation correct to the standards of books published by reputable publishers, however, I believe will remain so perpetually because of the myriad complications (most of which you and I do not even know, on account of being language-specific issues) on top of the already formidable challenges.

- Ahi

DawnFalcon · 09-02-2009, 10:55 AM

Quote:

Originally Posted by ahi

... and, of course, some of these languages will be such that they will have words whose meaning, and therefore correct hyphenation, depends entirely on the semantic context.

Which language is that?

Also, sorry, I also don't believe your statement about typography without at at least an informal proof. (Make a statement on a math problem, provide a proof or a demonstration. Thanks!)

Frabjous - A LaTeX install is also, at a minimum, hundreds of meg in size. This is one of the things I'm on about - it's not suitable as a typological processor in a low-resource environment. Typography is demnstrably mostly-solveable, by brute force, but that that soloution is not applicable to low-power devices.

ahi · 09-02-2009, 10:56 AM

Quote:

Originally Posted by frabjous

Using the failure of perfectability as an excuse for poo-pooing the push for software that does these things much better than what we currently have is well... incredibly silly.

To me, it feels a bit like a calculator pre-calculating (as much as possible) the sums of all additions and subtractions in advance, by taking the first value and counting upwards or downwards the number of times in the second value.

It's a worthy task done the wrong way and at the wrong stage of processing.

... and the other issues I mention one post above.

- Ahi

WillAdams · 09-02-2009, 11:09 AM

ahi wrote:
>some of these languages will be such that they will have words whose meaning, and therefore correct hyphenation, depends entirely on the semantic context.

and Dawnfalcon asked
>Which language is that?

English for one, see my example for ``present'' in the post just above.

The Knuth & Plass paper which I cited has a formal proof and discussion of the impossibility of finding the perfect set of breaks for a paragraph.

jbjb --- you keep saying that something is possible (machine-done, perfect page composition) and asking people to prove that it's not possible --- yet you can't prove that it is possible by showing us a single implementation --- yet a large number of people, some of whom work in this field are stating that it isn't possible, and have pointed you to research papers on the difficulties of this task.

I can't even find a grammar checker which can reliably disambiguate between the two different forms of ``present'', let alone every other such word in the English language --- and that's only a small part of the problem.

William

acidzebra · 09-02-2009, 11:14 AM

Quote:

Originally Posted by WillAdams

The Knuth & Plass paper which I cited has a formal proof and discussion of the impossibility of finding the perfect set of breaks for a paragraph.

Perfect? Sure, I'll buy that it is not possible. Good enough - who knows? I doubt humans produce "perfect" results.

edit: and of course, good enough for which set of people

jbjb · 09-02-2009, 11:14 AM

Quote:

Originally Posted by WillAdams

The appropriate cite for ``typography is not a machine solvable problem'' would be the Knuth-Plass paper ``Breaking Paragraphs into Lines'', D.E. Knuth and M.F. Plass, chapter 3 of _Digital Typography_, CSLI Lecture Notes #78.

I've not been able to find that paper for free on the net (do you have a link?), but excerpts that I've read from similar papers seem to indicate that there are some ways of defining the problem which are NP-complete, but others that aren't. As I said in another post, it all comes down to what is defined as an acceptable output.

Quote:

Please note that there is no H&J algorithm which can successfully detect and prevent ``stacks'' or rivers --- it seems to be (to use the formal computing term) ``NP Complete'' --- I'd be very interested in any research or algorithm which makes this a solvable problem.

Are you sure about that claim? Or is it really that there is no H&J algorithm which can successfully detect and prevent ``stacks'' or rivers that meets somebody's arbitrary criteria for "looks nice"?

Seems to me that simply e.g. detecting and preventing stacks should be fairly straight-forward if you're allowed to arbitrarily add space and break lines wherever you want.

I know that's a pedantic point, but non-computability is a formal thing and needs to be treated formally (i.e. with a better definition of what constitutes a solution before claiming one can't be found).

/JB

ahi · 09-02-2009, 11:20 AM

Quote:

Originally Posted by WillAdams

ahi wrote:
>some of these languages will be such that they will have words whose meaning, and therefore correct hyphenation, depends entirely on the semantic context.

and Dawnfalcon asked
>Which language is that?

English for one, see my example for ``present'' in the post just above.

The Knuth & Plass paper which I cited has a formal proof and discussion of the impossibility of finding the perfect set of breaks for a paragraph.

jbjb --- you keep saying that something is possible (machine-done, perfect page composition) and asking people to prove that it's not possible --- yet you can't prove that it is possible by showing us a single implementation --- yet a large number of people, some of whom work in this field are stating that it isn't possible, and have pointed you to research papers on the difficulties of this task.

I can't even find a grammar checker which can reliably disambiguate between the two different forms of ``present'', let alone every other such word in the English language --- and that's only a small part of the problem.

William

Thanks, William.

I think one fundamental problem is that people assume that some of these problems are beyond them, but surely not beyond some people out there smart enough or computers out there fast enough. When in fact... in a way, it truly is. The problem has been solved the only way it can be: with intelligent and educated people doing a good bit of work in advance. Any other solution will be recognizable poorer in quality for the foreseeable future... even if not forever.

Shall we say: The fact that I don't know how to do something, does not mean that somebody else out there does, or even that it is practically doable at all.

- Ahi

09-02-2009, 09:49 AM	#468
WillAdams Wizard Posts: 1,278 Karma: 3982000 Join Date: Feb 2008 Device: Amazon Kindle Scribe and Paperwhite (300ppi)	Pdurrant --- if you think inserting all possible appropriate hyphens is easy, please provide a rate quote for doing this to text on a per file basis per thousand characters of text. Please note that ``present'' has different hyphenation points depending on its pronounciation (whether it's a gift or the act of presenting something or the current time) and that any method for doing this would need to take into account any such words and only insert the appropriate and correct hyphenation. William Last edited by WillAdams; 09-02-2009 at 09:52 AM.

09-02-2009, 10:17 AM	#472
frabjous Wizard Posts: 1,213 Karma: 12890 Join Date: Feb 2009 Location: Amherst, Massachusetts, USA Device: Sony PRS-505	Wow, such overreaction. LaTeX already knows the hyphenation of most words, as already been not only stated but demonstrated. The fact that a million words exist, the vast majority of which are almost never used is completely irrelevant. Including the results of a script that finds exceptions to what the system knows, and identifies those that are known to be ambiguous (though grammar-check-like software could handle most such cases) to be given to the book designer at book creation to mark just these words likely wouldn't take more than 10 minutes per book. If the trade-off is between no hyphenation anywhere, or a non-reflowable format, vs the occasionally wrong pre-sent vs. pres-ent, I certainly would put up with the latter. There's no reason in principle why a computer can't identify stacks and rivers, and try to do something about them. LaTeX already can be programmed to completely avoid widows and orphans, though on a sufficiently small page, doing so would be a bad idea. Maybe a perfect algorithm isn't possible, and a computer can't do these things perfectly, though I'm not sure they're humanly perfectable either, but let me just say this... Using the failure of perfectability as an excuse for poo-pooing the push for software that does these things much better than what we currently have is well... incredibly silly. Last edited by frabjous; 09-02-2009 at 10:22 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
bad format of pdf ebook reader	Adolfo00	Calibre	9	04-22-2010 12:11 PM
Convert PDF To Sony eBook Format?	Sjwdavies	Sony Reader	12	12-13-2009 03:15 AM
Free eBook for Kindle or pdf format	cmwilson	Deals and Resources (No Self-Promotion or Affiliate Links)	38	05-06-2009 03:32 AM
Master Format for multi-format eBook Generation?	cerement	Workshop	43	04-01-2009 12:00 PM
Format Comparison: PDF, EPUB, and Mobi Downloads from Ebook Bundles	Kris777	News	2	01-22-2009 04:19 AM

09-02-2009, 09:47 AM	#467
WillAdams Wizard Posts: 1,278 Karma: 3982000 Join Date: Feb 2008 Device: Amazon Kindle Scribe and Paperwhite (300ppi)	jbjb, The appropriate cite for ``typography is not a machine solvable problem'' would be the Knuth-Plass paper ``Breaking Paragraphs into Lines'', D.E. Knuth and M.F. Plass, chapter 3 of _Digital Typography_, CSLI Lecture Notes #78. Please note that there is no H&J algorithm which can successfully detect and prevent ``stacks'' or rivers --- it seems to be (to use the formal computing term) ``NP Complete'' --- I'd be very interested in any research or algorithm which makes this a solvable problem. There're even fewer efforts to solve typographic problems at a level larger than a page --- and I've frequently had to relay an entire chapter because of how the last page fell out --- Here's a list of the current research on this: http://groups.google.com/group/comp....9?dmode=source So, unless someone has an example of an implementation which will automatically paginate a text and _not_ allow stacks, orphans or other bad breaks, I believe that the above references should stand as the requested citation to demonstrate that, ``typography is not a machine solvable problem''. William

09-02-2009, 11:09 AM	#477
WillAdams Wizard Posts: 1,278 Karma: 3982000 Join Date: Feb 2008 Device: Amazon Kindle Scribe and Paperwhite (300ppi)	ahi wrote: >some of these languages will be such that they will have words whose meaning, and therefore correct hyphenation, depends entirely on the semantic context. and Dawnfalcon asked >Which language is that? English for one, see my example for ``present'' in the post just above. The Knuth & Plass paper which I cited has a formal proof and discussion of the impossibility of finding the perfect set of breaks for a paragraph. jbjb --- you keep saying that something is possible (machine-done, perfect page composition) and asking people to prove that it's not possible --- yet you can't prove that it is possible by showing us a single implementation --- yet a large number of people, some of whom work in this field are stating that it isn't possible, and have pointed you to research papers on the difficulties of this task. I can't even find a grammar checker which can reliably disambiguate between the two different forms of ``present'', let alone every other such word in the English language --- and that's only a small part of the problem. William