MobileRead Forums - View Single Post - Telling a text-to-speech reader how to pronounce things?

Tex2002ans · 11-19-2021, 08:30 PM

Quote:

Originally Posted by Simons Mith

[...] the user preferences that I was trying to find out about. I eventually got some answers from actual users on a 'Blind' forum on Reddit. [...]

I'm not sure how true /this/ aspect is any more, but back then there were two schools of preference for audio books. Some people (more than half, but not overwhelmingly more) liked them read by a real person. Some people (a non-negligible minority) actually liked a robot voice. And they liked the robot voice to be as flavourless as possible, and they would dial it up to 200% speed (or faster, with practice) so that they could listen to an audiobook at super-speed.

On Audio Speed

Yes. Over the past few years, I've slowly ramped up my audio speed.

And the more I'm used to the voice, the faster I can go.

When I first started listening to podcasts (and TTS), I bumped myself up to 1.2x speed.

My thinking was: "I can get 20% more productivity out of this." (Listening to the same stuff in 80% of the time OR listening to 20% more material.)

Once I got used to that, I settled on 1.3x for the longest time.

After about a year, I quickly ramped up to 1.6x->2x and beyond. (Now, I listen to most audio+video at 2.5x–3x.)

Another enhancement I've done is "cut the silence".

~33% of all speaking is completely dead air (breathing, thinking, etc.). If you remove that from podcasts/lectures/videos, you've also shaved off 33% of the time.

Take a 1 hour lecture as an example:

Code:

Speed   Time (mins)   Time (Remove Silence)
1       60            40
1.2     50            33.3
1.5     40            26.6
2       30            20

You'd take 1 hour to listen to the lecture, and I can finish it in 20 minutes.

Or another way of looking at it:

I can listen to 3 full lectures in the same time it would take for you to complete 1! (20+20+20 vs. 60)

* * *

On Overriding User-Defined Settings

The past few days, I was reading through lots of "CSS Speech" material (and watching those Interspeech talks).

I ran across this article:

2017: "Let's Talk About Speech" by Eric Baily (CSS-Tricks.com)

which discussed how horrible the support for CSS Speech still is + "Just because you can, doesn’t mean you should".

It also referenced this fantastic article/chapter:

Chapter 11 from "Building Accessible Websites" (2002) by Joe Clark.

(Clark is the creator of CSS "Aural Stylesheets", which have since been deprecated in favor of "CSS Speech".)

I bolded the relevant section:

Quote:

Aural application

“Who’s gonna use this?” you ask. The answer is: Effectively no one.

Media stylesheets in general are poorly supported. Even a simple print stylesheet – for printed pages as opposed to screen display – will be ignored by certain browser versions (and some media-stylesheet combinations will crash our old friend, that carcinoma of the Web, Netscape 4).

We also face the issue of appropriateness of device. Remember the summary attribute of HTML tables? The W3C specification tells us unequivocally: “This attribute provides a summary of the table’s purpose and structure for user agents rendering to non-visual media such as speech and Braille.” It is not even a subject of debate whether or not a graphical browser should support summary. It must not do so, except inasmuch as such a browser has a speech or “non-visual” mode. (iCab on Macintosh can read Web pages aloud, and when it does so it reads the summary aloud, too.)

Why should graphical browsers support aural stylesheets?

Shouldn’t that support be hived off onto screen readers?

But those programs already offer a vast range of controls for vocal characteristics. To make a visual analogy, a low-vision person may find the graphical defaults chosen by Web authors mildly annoying and may set up browser defaults or a user stylesheet to override them. But if Web designers set up aural stylesheets that override a screen-reader user’s very-carefully-thought-out speech choices, honed over weeks and months of use, in favour of something you slapped together because you liked the idea of using Elmer Fudd’s voice to enunciate link text, the blind visitor may well end up far more than mildly annoyed.

It is a greater sin to mess with an individual blind visitor’s speech settings via ACSS than any sin you could imagine that affects low-vision or colourblind people. Annoying sounds are far more annoying than annoying images. Rejigging a user’s volume settings alone is more than enough to make you an enemy for life. Among other things, sound settings are harder to avoid: If you think a blackboard is ugly, you can look away, but you cannot look away from the sound of fingernails scratching a blackboard. If you dislike the appearance of a Website, you have a remarkable armamentarium at your disposal to reformulate that site’s visual rendering to your liking via user CSS. But if you’re stuck with somebody else’s voice and sound choices, you truly are stuck.

So much of this still holds perfectly true today as it did in 2002.

* * *

Another thing I've written about over the years, is:

"How do blind people (or Screen Readers) read actual HTML/code?"

Many Screen Readers then set manual overrides to make their own custom noises, like dings or bells, for things like italics/emphasis/lists (<i> + <em> + <li>).

And like the large quote above, overriding user customizations should be a cardinal sin! (Similar to those rotten websites that try to override/disable keyboard shortcuts!)

I'd also recommend checking out the recent:

DAISY Consortium: "Ways People with Print Disabilities Read" (September 2021)

And remember, it's not just "blind people" using audio, there are lots of low-vision (or normal) cases where a reader may be reading in completely alternate ways.

As an ebook designer... you want to mark your ebooks up with proper HTML (correct lang, <i> vs. <em>, Headings as <h1-6>, [...]), but not get in the way of the user themselves.

Quote:

Originally Posted by Simons Mith

They would have found the audio flourishes that I was wondering about adding to be extremely annoying. [Basically because they reached a state of flow where they stopped noticing the robo-voice altogether and could concentrate entirely on the text. Meddling with the voices in any way (e.g. raising the pitch or speed for a child's voice, using different voices for different characters in dialogue, any of that kind of stuff) would have broken the flow for them.]

Yes. And because I've gotten acclimated to certain voices, I can listen to those faster than normal.

On Audio Voices

If I'm listening to a podcast, and they're interviewing someone with a very thick accent (or someone I'm not used to), I must slow down the audio (typically to 1.5x or 2x). Same if it's a female (they tend to speak higher pitched, so speeding up too fast gets very hard to understand).

If a book was flipflopping between my preferred voice, overriding my settings, etc., I too would probably get angry.

There may be a case for using CSS Speech to hint broad categories, like "Male vs. Female" OR "Male 1 vs. Male 2". Kind of like I wrote about in a 2017 sidenote while discussing JAWS + proper language markup...

But then the reality of an ebook designer marking this stuff up in that detail at the sentence-level (and doing it properly)... very slim to none. (Also see large "knowledge gap" quote below.)

Quote:

Originally Posted by Simons Mith

In principle the CSS was already defined to do those things, even back then, but no-one had implemented it, and the reason no one was rushing to implement it was because there was no discernible demand for it to /be/ implemented. I'd bet this is still true today.

Yes. And in that chapter I referenced, there was also this section:

Quote:

The knowledge gap

As everywhere in media access per se (think of captioning, audio description, subtitling, and dubbing), even if we enjoyed a flawlessly reliable technical infrastructure for aural stylesheets, how many working Web designers and developers would know how to write them?

You’re pretty handy in Photoshop, and you can even write all-CSS layouts. You’ve written entire back ends in SQL. Audio? You can handle audio, kind of. You’ve certainly ripped MP3s to compact disc. Now, though, your boss (or the World Wide Web Consortium, whichever is worse) wants you to craft computer voices, position them in three-dimensional space, and specify background music and tones for special components.

You simply don’t have that training. Nor should anyone expect you to have it. Nor is there anywhere you can get that training.

At the authorial level, aural stylesheets are a character in search of an author. Literally.

And I agree. I still think CSS is the completely wrong level to handle this.

You have the alternate level above, the "TTS engines", which will handle parsing + adding all that SSML automatically for you, etc. Those engines/networks can (and have been) getting better all the time.

Yes, perhaps in the future, there can be some reader with an easy-to-read/-manipulate (separate) file you can feed with a list of Proper Nouns + special pronunciations... but to clog up the HTML+CSS with all of that? No.