Telling a text-to-speech reader how to pronounce things?

Simons Mith · 11-05-2021, 09:49 PM

I tried reading my self-written ebook using Windows text-to-speech and was quite impressed with how well it worked, even on made-up proper nouns. But it stumbled here and there. It correctly read 1,000km as "one thousand kilometres", for example, but 12m was read as one-two-m rather than "12 metres". Is there a way to embed the correct pronunciation for words like this, that the Calibre reader can use?

I know I could rewrite that example as 12 metres, but there are other cases where that's not an option. Trickiest one I've noticed was a place where it used the wrong word stress for 'record', pronouncing it as the noun rather than the verb. Is there a way to tell the text-to-speech reader how to pronounce tricky stuff correctly? I'll put it in if it's easy to do.

I know many TTS readers can be manually coded by the user with rules on how to pronounce unfamiliar words - can I embed that information in the epub so that the users don't have to?

Jellby · 11-06-2021, 03:46 AM

Some nitpicking: The correct typesetting is "1,000 km" and "12 m", with a non-breaking, non-stretching space if you want, but with a space. I don't know if that will help you though.

Simons Mith · 11-08-2021, 06:17 AM

I have found some references to VTML online [https://static.carahsoft.com/concret..._Language.pdf] but it smells rather proprietary to me.

While that would let me do something like

Code:

<vtml_partofsp part="verb">record</vtml_partofsp>

to get 'record' pronounced as a verb, I have no idea how well it will
work in general.

Fixing these mispronunciations certainly counts as nice-to-have rather than vital, and anyway they're commendably rare considering it's a sci-fi book, but is there
a better way than vtml tags, which I only found out about yesterday? I'm thinking again about the typography tweaks for the units. I find conventional spaces to be too wide for my tastes, but IME the various narrow spaces are less well supported. I don't want to get a little bit fancy and then have those annoying boxes appear because some lame reader doesn't know what a figure space   is. OTOH my experience on rendering of custom spaces might be out of date now. Maybe they're reliably supported for the most part?

Doitsu · 11-08-2021, 08:50 AM

There's actually a W3C draft:

EPUB 3 Text-to-Speech Enhancements 1.0

IVONA 2 Text-To-Speech has SSML support:

SSML Support in Ivona Text-To-Speech

Some Microsoft SAPI voices also have limited SSML support.

Improve synthesis with Speech Synthesis Markup Language (SSML)

However, AFAIK, there aren't any epub3 apps with SSML support.

Simons Mith · 11-08-2021, 12:04 PM

Ah, thank you. Not so much a sleeping dog to be let lie, as a puppy whose eyes haven't opened yet. I'll not worry about it for now, but maybe revisit in a couple of years.

Tex2002ans · 11-12-2021, 10:58 PM

Quote:

Originally Posted by Simons Mith

Is there a way to tell the text-to-speech reader how to pronounce tricky stuff correctly?

[...] can I embed that information in the epub so that the users don't have to?

Yes* and no.

Quote:

Originally Posted by Doitsu

There's actually a W3C draft:

EPUB 3 Text-to-Speech Enhancements 1.0

[...]

However, AFAIK, there aren't any epub3 apps with SSML support.

The EPUB3 specs are based on "CSS3 Speech"... and like you said, no e-reader actually supports this.

Back in 2018, I emailed an ex-MR user (who now works for one of the largest Text-to-Speech companies) about this very topic.

About CSS3 Speech, he told me:

"On paper, nice, but everyone in the industry uses SSML 1.0 or 1.1. In statistical terms, very few will care about adding Speech CSS to their HTML5 documents."

If you're interested in Text-to-Speech (TTS), he also recommended the Interspeech conference. That's a lot of the bleeding edge information about parsing, processing, and generating the highest quality speech.

* * *

From my recalling, what would typically happen is:

1. You have a preprocessor which inputs the text/HTML, then converts it into SSML.

2 (Optional). You can add manual hints to the SSML, like adding:

specific pronunciations
- Like odd First/Last Names.
emphasis
- In HTML it's , in SSML it's <emphasis>.
which language the text is in
- "tacos" in English ≠ "tacos" in Spanish
Male/Female (or Child/Adult) voice

+ parse the grammar,* so you can tell differences between:

The Polish army invaded the country over stolen shoe polish.
- "Polish" (the country) vs. "polish" (like 'polishing a shoe')
100 m
- "100 meters" vs. "100 'EM'"
2x × y / (3z + 4)
- "2 'EX' 'TIMES' 'WHY' 'OVER' 3 'ZEE' 'PLUS' 4"
I was smacked over the head with a 2×4.
- "I was smacked over the head with a 2 'BY' 4."

3. You feed that SSML into a TTS engine/backend, like:

Amazon Polly
Nuance Text-to-Speech
Microsoft Azure Text-to-Speech
Mycroft.ai
[...]

and they send you an audio file back.

4. You play the audio file.

* * *

Anyway, I haven't really done much research into it since 2018. Looks like I have quite a few years of conferences to catch up on.

For that step 2, a ton of the research goes into computing this stuff automatically.

So you can just feed it text, and "the cloud"/AI will figure out correct pronunciations + how to make it sound as realistic as possible:

"breathing"
"mood"
- Happy/Sad/Excited
pauses after punctuation
- and not pausing after period in a middle initial: "Monkey D. Luffy" + "P.T. Barnum"
emphasis
[...]

Side Note: To see some of the amazing advances within the past few years... there was a mod for the best-selling games from 2015, "The Witcher 3".

Earlier this year, a user took the dozens of hours of audio from the game, fed it into a neural network, then used that to generate completely new dialogue via TTS:

Youtube: "[Witcher 3] New Quest MOD - A Night to Remember (trailer)"

Games can now use similar techniques to generate all sorts of languages + automatically lip-sync.

And movies and TV shows can also redub, having the "original actors" speaking fluently in the dubbed language. (I believe this has already been used in the Marvel superhero movies.)

Quote:

Originally Posted by Jellby

Some nitpicking: The correct typesetting is "1,000 km" and "12 m", with a non-breaking, non-stretching space if you want, but with a space.

The "no space before units" is an extremely common error. For more info, see Wikipedia: "International System of Units > Lexicographic conventions > General Rules".

Though if you feed this into TTS engines, they typically get units correct (space or no space).

Quoth · 11-13-2021, 08:15 AM

I was using a text to speech markup on DOS in 1991 for UI of an Ice Cream making machine running on a PC. To save money the audio DAC wasn't a sound card but eight resistors on the parallel port driving a cheap IC amp on one of the custom controiller PCBs.

The issue with a book is that a human has to proof read the entire book adding the markup and the CSS route is horribly flawed.

There is such a disconnect written & pronounced and so many exceptions to rules that really natural text to speech needs a separate file. Actually English speech isn't quite the same language as written English. Compare a play, TV, film script (not Shakespeare) with novelization. Or a Radio soap with narrated book. Obviously if you just want narrated text then some system of escaping words with the spoken version works. But now is an actual audio book better than that and simply doing NOTHING to the source text and leaving it up to a best effort speech engine better than CSS speech extensions or SSML rules?
And people not visually impaired now use audio books which was not the case 1899 to 1979.

Tex2002ans · 11-13-2021, 12:52 PM

Quote:

Originally Posted by Quoth

There is such a disconnect written & pronounced and so many exceptions to rules that really natural text to speech needs a separate file.

Just like grammarchecking, you need a completely different level of parsing to break down words.

Language also changes over time, and new spellings/usages/accents/pronunciations constantly come into play.

Take this example:

The bow bowed back, then I shot across the bow. In awe, the servants bowed before me.

1 = bow, as in bow and arrow
2 = bowed, as in bending
3 = bow, as in a warning shot
4 = bowed, as in kneeling + lowering head

The first 2 are said with 'b' + "OH" sound.

The next 2 are said with 'b' + "OW" sound.

Another good example is:

The colonels popped kernels of popcorn in the microwave.

Both words are spoken exactly the same (in current-day English), but that's not how it always was.

For more information on this, I recommend the fantastic podcast, "Lexicon Valley" by John McWhorter.

Here's a few episodes covering:

Side Note: Just a few months ago, McWhorter handed the podcast off to two other people (so now the original podcast has confusingly been name-changed to "Spectacular Vernacular").

But you can find him at the new "Lexicon Valley":

https://www.booksmartstudios.org/s/lexicon-valley

Here's the first episode from the new version:

"English Has a Bee in Its Bonnet"

where he explains where the heck "bee" in "spelling bee" comes from. (And other fascinating stuff.)

Quote:

Originally Posted by Quoth

But now is an actual audio book better than that and simply doing NOTHING to the source text and leaving it up to a best effort speech engine better than CSS speech extensions or SSML rules?

And people not visually impaired now use audio books which was not the case 1899 to 1979.

The better TTS engines/networks get, the better these things can do with plaintext input. (Toss some samples into Google's Cloud Text-to-Speech and see how it sounds.)

The fantastic thing about Text-to-Speech is you don't need a human middleman to read the stuff.

99%+ of written text wouldn't be accessible to the blind—think bills/letters/flyers/boxes/cans + dynamically generated content (phone numbers, addresses, dates, names, $ amounts, auto-translated text).

And many times, there's very personal information inside—think texts between spouses or emails between friends. (Are blind people supposed to have zero privacy?)

One of the best talks I ever saw on this topic was from 2013:

Ron McCallum: "How Technology Allowed Me to Read" (TEDxSydney)

Definitely give it a listen.

Side Note: Personally, a lot of the journals/books I read are so obscure that there would never be a market for human-read audiobook versions. But with Text-to-Speech, I can listen to anything/everything while I work.

A "90% good" TTS version of the ebook is 100% better than 0% human-read.

And if you compare the quality of Android/Google's TTS vs. the robotic crap on Windows, it's pretty close to a human reading to me (besides wrongly pronouncing odd names, obscure words, and "bow" vs. "bow").

That high-quality, bleeding-edge TTS will trickle its way down into the OSes themselves, and if we stop back in another 10 years, you'll see all that breathing+mood+other enhancements make their way down to the free version sitting right inside your pocket.

And those that create ebooks can do their best to take reasonable measures with markup... like marking the proper language so "tacos" (English) + "tacos" (Spanish) can be pronounced correctly (at some near-future date!). That would be infinitely more helpful than manually trying to insert CSS Speech + you can actually benefit from language markup now.

Quoth · 11-13-2021, 01:52 PM

I'm not against Text to Speech, but arguing that for actual novels it's never quite good enough. Certainly Ray Kurzweil's late 1970s scan + OCR + speech synth can be on a smartphone now but those are ghastly for the blind or partially sighted. So scan + OCR + synth is sort of generally available.

Quote:

That high-quality, bleeding-edge TTS will trickle its way down into the OSes themselves, and if we stop back in another 10 years, you'll see all that breathing+mood+other enhancements make their way down to the free version sitting right inside your pocket.

But that's exactly what people were telling me in late 1980s to late 1990s. I don't hear much evidence that it's much better than state of the art then. The Kindle DX and speech pack for the Kindle PW3 are strangely poorer than 2002 Windows XP (built-in free option). I need to figure out Linux and Android Text to Speech for my friend who is now almost blind with Macular Degeneration. Except Covid has meant his ex Nurse wife has pulled up the drawbridge!

Recognition has been slower and gone backwards, needing always on Internet.

The single chip synthesisers for speech were far behind state-of-the-art and rubbish with ordinary text. You needed a crafted file.
Something using someone else's server isn't a solution. Google's AI is also dumb pattern matching using misappropriated information. They have no AI.

Quote:

And those that create ebooks can do their best to take reasonable measures with markup... like marking the proper language so "tacos" (English) + "tacos" (Spanish) can be pronounced correctly (at some near-future date!). That would be infinitely more helpful than manually trying to insert CSS Speech + you can actually benefit from language markup now.

Absolutely!

But Audio books have a problem too.

English isn't spoken the same everwhere in Ireland or UK, nor across the USA. Or SA, Canada, Australia.

Best practice is use the speech patterns and accent the author intends OR to use the local dialect? Which?
Narration is hard work and needs skill. A sample in my sig.

Tex2002ans · 11-13-2021, 02:55 PM

Quote:

Originally Posted by Quoth

But that's exactly what people were telling me in late 1980s to late 1990s. I don't hear much evidence that it's much better than state of the art then.

Again, this is absurd. (And I think we had this conversation years ago.)

Listen to that Witcher 3 video above. It sounds near-exact to the actual actor. That isn't him speaking in the video, it's the TTS-trained-on-his-voice.

Compare to the actual voice actor in the game.

Side Note: Another cool thing, using this narrowly refined/trained TTS, is obscure words/terms/pronunciations are automatically correct as well.

Like Geralt's name is actually pronounced with the "G" sound + the accent is on the second syllable:

- geh-RALT
--- "ralt" like "salt"

Not like Gerald:

- JEH-ruld

Ciri (one of the character's names) will be spoken like:

- See-ree

not like:

- Ky-ree

"Kaer Morhen" (made-up place within the books). Well, it'll be spoken just like the game.

All the modder had to do was feed it the text, and the neural network took it from there.

Quote:

Originally Posted by Quoth

I need to figure out Linux and Android Text to Speech for my friend who is now almost blind with Macular Degeneration.

PocketBook Reader is what I use on Android to read EPUBs. You can just press the TTS button, and it'll speak using the built-in Android TTS.

On Android OS itself, you'd enable TalkBack... but that takes over the full functionality of the phone. If you want to see some of that, see the recent Techmoan video: "An app that sees for those who can’t", especially at 19:04 where he covers TalkBalk (and its iOS equivalent).

Quote:

Originally Posted by Quoth

Recognition has been slower and gone backwards, needing always on Internet.

No. Google Text-to-Speech is all on-device. No internet needed.

What happens is you may need internet + "the cloud"... if you want much more accurate speech (like you've been bringing up). But that's only because the computing power needed is enormous (sucking up a cellphone's battery for example) + the amount of data needed is staggering.

For more technical information on that, see Computerphile's fantastic video: "GPT3: An Even Bigger Language Model".

For example, GPT3 is 570 GBs of text:

"How Large Language Models Will Transform Science, Society, and AI" (Stanford University)

Can you fit that on your cellphone? Will you spend enough CPU power on your cellphone, and wait around minutes/hours, trying to generate that audio? (It sure as hell won't happen in real-time or at a speed you'd like.)

Or, you can use the Google Text-to-Speech built into Android, around 250 MBs, and get yourself 95% of the way there in real-time.

Quote:

Originally Posted by Quoth

Something using someone else's server isn't a solution. Google's AI is also dumb pattern matching using misappropriated information. They have no AI.

Please, just stop. You're beginning to embarrass yourself.

Check out some of those videos if you're interested. Perhaps take a look at the past 30 years of advancements in the field.

I even just showed you enormous strides taken within the past 5!

Quote:

Originally Posted by Quoth

Best practice is use the speech patterns and accent the author intends OR to use the local dialect? Which?
Narration is hard work and needs skill. A sample in my sig.

But the TTS is getting you "good enough".

It also won't make professionally-produced audiobooks go away, but it's tackling completely different use-cases (or many things that would never be economically viable to produce in the first place... like the rare journal articles).

TTS is a much larger category—and books are just a small subset. (Completely dwarfed by the sheer amount of non-book content like forum posts, emails, documents, etc.)

Quoth · 11-14-2021, 09:06 AM

Quote:

PocketBook Reader is what I use on Android to read EPUBs. You can just press the TTS button, and it'll speak using the built-in Android TTS.

He doesn't read ebooks at all.

Quote:

Please, just stop. You're beginning to embarrass yourself.

Check out some of those videos if you're interested. Perhaps take a look at the past 30 years of advancements in the field.

I even just showed you enormous strides taken within the past 5!

Smoke and mirrors. Don't look at the man behind the curtain. All the terms in marketing and description of so-called AI are totally misleading.

Simons Mith · 11-19-2021, 06:24 PM

[extra info] Interesting. Thanks all.

I can add a somewhat related datum from some other work I did some time back.

This was another project entirely, about 5-7 years ago, and even the RNIB couldn't give a straight answer on the user preferences that I was trying to find out about. I eventually got some answers from actual users on a 'Blind' forum on Reddit. Afraid I don't remember exactly where I asked now, sorry.

I'm not sure how true /this/ aspect is any more, but back then there were two schools of preference for audio books. Some people (more than half, but not overwhelmingly more) liked them read by a real person. Some people (a non-negligible minority) actually liked a robot voice. And they liked the robot voice to be as flavourless as possible, and they would dial it up to 200% speed (or faster, with practice) so that they could listen to an audiobook at super-speed.

An interesting piece of feedback I got back then was that there were never enough different sound fonts (and probably never would be) and that the minority of people who did like robo-voices would tend to find one voice they liked and stick with it. They would have found the audio flourishes that I was wondering about adding to be extremely annoying. [Basically because they reached a state of flow where they stopped noticing the robo-voice altogether and could concentrate entirely on the text. Meddling with the voices in any way (e.g. raising the pitch or speed for a child's voice, using different voices for different characters in dialogue, any of that kind of stuff) would have broken the flow for them.] In principle the CSS was already defined to do those things, even back then, but no-one had implemented it, and the reason no one was rushing to implement it was because there was no discernible demand for it to /be/ implemented. I'd bet this is still true today.

So while getting realistic voices that can emote and act may be a laudable goal, there's also a buncha people who like their robo-voice to be as bland as possible so that it doesn't intrude between them and the text.

To be clear, adding character to robo-voices is a different objective from just getting them to pick the right pronunciation of dove, dove, bow, bow, either, either, potato, potato and so on, but the technology is related. We probably won't get either until the technology has advanced to the point where we can get both together.

Tex2002ans · 11-19-2021, 08:30 PM

Quote:

Originally Posted by Simons Mith

[...] the user preferences that I was trying to find out about. I eventually got some answers from actual users on a 'Blind' forum on Reddit. [...]

I'm not sure how true /this/ aspect is any more, but back then there were two schools of preference for audio books. Some people (more than half, but not overwhelmingly more) liked them read by a real person. Some people (a non-negligible minority) actually liked a robot voice. And they liked the robot voice to be as flavourless as possible, and they would dial it up to 200% speed (or faster, with practice) so that they could listen to an audiobook at super-speed.

On Audio Speed

Yes. Over the past few years, I've slowly ramped up my audio speed.

And the more I'm used to the voice, the faster I can go.

When I first started listening to podcasts (and TTS), I bumped myself up to 1.2x speed.

My thinking was: "I can get 20% more productivity out of this." (Listening to the same stuff in 80% of the time OR listening to 20% more material.)

Once I got used to that, I settled on 1.3x for the longest time.

After about a year, I quickly ramped up to 1.6x->2x and beyond. (Now, I listen to most audio+video at 2.5x–3x.)

Another enhancement I've done is "cut the silence".

~33% of all speaking is completely dead air (breathing, thinking, etc.). If you remove that from podcasts/lectures/videos, you've also shaved off 33% of the time.

Take a 1 hour lecture as an example:

Code:

Speed   Time (mins)   Time (Remove Silence)
1       60            40
1.2     50            33.3
1.5     40            26.6
2       30            20

You'd take 1 hour to listen to the lecture, and I can finish it in 20 minutes.

Or another way of looking at it:

I can listen to 3 full lectures in the same time it would take for you to complete 1! (20+20+20 vs. 60)

* * *

On Overriding User-Defined Settings

The past few days, I was reading through lots of "CSS Speech" material (and watching those Interspeech talks).

I ran across this article:

2017: "Let's Talk About Speech" by Eric Baily (CSS-Tricks.com)

which discussed how horrible the support for CSS Speech still is + "Just because you can, doesn’t mean you should".

It also referenced this fantastic article/chapter:

Chapter 11 from "Building Accessible Websites" (2002) by Joe Clark.

(Clark is the creator of CSS "Aural Stylesheets", which have since been deprecated in favor of "CSS Speech".)

I bolded the relevant section:

Quote:

Aural application

“Who’s gonna use this?” you ask. The answer is: Effectively no one.

Media stylesheets in general are poorly supported. Even a simple print stylesheet – for printed pages as opposed to screen display – will be ignored by certain browser versions (and some media-stylesheet combinations will crash our old friend, that carcinoma of the Web, Netscape 4).

We also face the issue of appropriateness of device. Remember the summary attribute of HTML tables? The W3C specification tells us unequivocally: “This attribute provides a summary of the table’s purpose and structure for user agents rendering to non-visual media such as speech and Braille.” It is not even a subject of debate whether or not a graphical browser should support summary. It must not do so, except inasmuch as such a browser has a speech or “non-visual” mode. (iCab on Macintosh can read Web pages aloud, and when it does so it reads the summary aloud, too.)

Why should graphical browsers support aural stylesheets?

Shouldn’t that support be hived off onto screen readers?

But those programs already offer a vast range of controls for vocal characteristics. To make a visual analogy, a low-vision person may find the graphical defaults chosen by Web authors mildly annoying and may set up browser defaults or a user stylesheet to override them. But if Web designers set up aural stylesheets that override a screen-reader user’s very-carefully-thought-out speech choices, honed over weeks and months of use, in favour of something you slapped together because you liked the idea of using Elmer Fudd’s voice to enunciate link text, the blind visitor may well end up far more than mildly annoyed.

It is a greater sin to mess with an individual blind visitor’s speech settings via ACSS than any sin you could imagine that affects low-vision or colourblind people. Annoying sounds are far more annoying than annoying images. Rejigging a user’s volume settings alone is more than enough to make you an enemy for life. Among other things, sound settings are harder to avoid: If you think a blackboard is ugly, you can look away, but you cannot look away from the sound of fingernails scratching a blackboard. If you dislike the appearance of a Website, you have a remarkable armamentarium at your disposal to reformulate that site’s visual rendering to your liking via user CSS. But if you’re stuck with somebody else’s voice and sound choices, you truly are stuck.

So much of this still holds perfectly true today as it did in 2002.

* * *

Another thing I've written about over the years, is:

"How do blind people (or Screen Readers) read actual HTML/code?"

Many Screen Readers then set manual overrides to make their own custom noises, like dings or bells, for things like italics/emphasis/lists ( + + <li>).

And like the large quote above, overriding user customizations should be a cardinal sin! (Similar to those rotten websites that try to override/disable keyboard shortcuts!)

I'd also recommend checking out the recent:

DAISY Consortium: "Ways People with Print Disabilities Read" (September 2021)

And remember, it's not just "blind people" using audio, there are lots of low-vision (or normal) cases where a reader may be reading in completely alternate ways.

As an ebook designer... you want to mark your ebooks up with proper HTML (correct lang, vs. , Headings as <h1-6>, [...]), but not get in the way of the user themselves.

Quote:

Originally Posted by Simons Mith

They would have found the audio flourishes that I was wondering about adding to be extremely annoying. [Basically because they reached a state of flow where they stopped noticing the robo-voice altogether and could concentrate entirely on the text. Meddling with the voices in any way (e.g. raising the pitch or speed for a child's voice, using different voices for different characters in dialogue, any of that kind of stuff) would have broken the flow for them.]

Yes. And because I've gotten acclimated to certain voices, I can listen to those faster than normal.

On Audio Voices

If I'm listening to a podcast, and they're interviewing someone with a very thick accent (or someone I'm not used to), I must slow down the audio (typically to 1.5x or 2x). Same if it's a female (they tend to speak higher pitched, so speeding up too fast gets very hard to understand).

If a book was flipflopping between my preferred voice, overriding my settings, etc., I too would probably get angry.

There may be a case for using CSS Speech to hint broad categories, like "Male vs. Female" OR "Male 1 vs. Male 2". Kind of like I wrote about in a 2017 sidenote while discussing JAWS + proper language markup...

But then the reality of an ebook designer marking this stuff up in that detail at the sentence-level (and doing it properly)... very slim to none. (Also see large "knowledge gap" quote below.)

Quote:

Originally Posted by Simons Mith

In principle the CSS was already defined to do those things, even back then, but no-one had implemented it, and the reason no one was rushing to implement it was because there was no discernible demand for it to /be/ implemented. I'd bet this is still true today.

Yes. And in that chapter I referenced, there was also this section:

Quote:

The knowledge gap

As everywhere in media access per se (think of captioning, audio description, subtitling, and dubbing), even if we enjoyed a flawlessly reliable technical infrastructure for aural stylesheets, how many working Web designers and developers would know how to write them?

You’re pretty handy in Photoshop, and you can even write all-CSS layouts. You’ve written entire back ends in SQL. Audio? You can handle audio, kind of. You’ve certainly ripped MP3s to compact disc. Now, though, your boss (or the World Wide Web Consortium, whichever is worse) wants you to craft computer voices, position them in three-dimensional space, and specify background music and tones for special components.

You simply don’t have that training. Nor should anyone expect you to have it. Nor is there anywhere you can get that training.

At the authorial level, aural stylesheets are a character in search of an author. Literally.

And I agree. I still think CSS is the completely wrong level to handle this.

You have the alternate level above, the "TTS engines", which will handle parsing + adding all that SSML automatically for you, etc. Those engines/networks can (and have been) getting better all the time.

Yes, perhaps in the future, there can be some reader with an easy-to-read/-manipulate (separate) file you can feed with a list of Proper Nouns + special pronunciations... but to clog up the HTML+CSS with all of that? No.

Quoth · 11-20-2021, 07:46 AM

^^^^
Great points.

fabien.benoit.19 · 12-07-2021, 09:42 AM

@Tex2002ans what tts software do you use actually?
@Simons Mith Did you manage to find a good choice for your task?

11-05-2021, 09:49 PM	#1
Simons Mith Member Posts: 20 Karma: 10 Join Date: Oct 2020 Device: none	Telling a text-to-speech reader how to pronounce things? I tried reading my self-written ebook using Windows text-to-speech and was quite impressed with how well it worked, even on made-up proper nouns. But it stumbled here and there. It correctly read 1,000km as "one thousand kilometres", for example, but 12m was read as one-two-m rather than "12 metres". Is there a way to embed the correct pronunciation for words like this, that the Calibre reader can use? I know I could rewrite that example as 12 metres, but there are other cases where that's not an option. Trickiest one I've noticed was a place where it used the wrong word stress for 'record', pronouncing it as the noun rather than the verb. Is there a way to tell the text-to-speech reader how to pronounce tricky stuff correctly? I'll put it in if it's easy to do. I know many TTS readers can be manually coded by the user with rules on how to pronounce unfamiliar words - can I embed that information in the epub so that the users don't have to?

11-08-2021, 06:17 AM	#3
Simons Mith Member Posts: 20 Karma: 10 Join Date: Oct 2020 Device: none	I have found some references to VTML online [https://static.carahsoft.com/concret..._Language.pdf] but it smells rather proprietary to me. While that would let me do something like Code: <vtml_partofsp part="verb">record</vtml_partofsp> to get 'record' pronounced as a verb, I have no idea how well it will work in general. Fixing these mispronunciations certainly counts as nice-to-have rather than vital, and anyway they're commendably rare considering it's a sci-fi book, but is there a better way than vtml tags, which I only found out about yesterday? I'm thinking again about the typography tweaks for the units. I find conventional spaces to be too wide for my tastes, but IME the various narrow spaces are less well supported. I don't want to get a little bit fancy and then have those annoying boxes appear because some lame reader doesn't know what a figure space   is. OTOH my experience on rendering of custom spaces might be out of date now. Maybe they're reliably supported for the most part?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
PRS-T1 Text To Speech Reader	heinblöd	Sony Reader Dev Corner	1	11-20-2017 03:35 PM
Kindle Touch - Using text-to-speech to pronounce individual words	nosfera2	Kindle Developer's Corner	0	02-10-2012 11:59 AM
iPad VBookz - A Text-to-Speech Reader	scottjl	Apple Devices	3	10-23-2010 10:50 AM
Request Add Text To Speech to the Reader	kenjennings	enTourage Archive	12	07-26-2010 08:47 AM
Any Reader With Text-To-Speech Besides Kindle?	ginakra	Which one should I buy?	12	10-17-2009 10:41 AM

11-06-2021, 03:46 AM	#2
Jellby frumious Bandersnatch Posts: 7,515 Karma: 18512745 Join Date: Jan 2008 Location: Spaniard in Sweden Device: Cybook Orizon, Kobo Aura	Some nitpicking: The correct typesetting is "1,000 km" and "12 m", with a non-breaking, non-stretching space if you want, but with a space. I don't know if that will help you though.

11-08-2021, 08:50 AM	#4
Doitsu Grand Sorcerer Posts: 5,583 Karma: 22735033 Join Date: Dec 2010 Device: Kindle PW2	There's actually a W3C draft: EPUB 3 Text-to-Speech Enhancements 1.0 IVONA 2 Text-To-Speech has SSML support: SSML Support in Ivona Text-To-Speech Some Microsoft SAPI voices also have limited SSML support. Improve synthesis with Speech Synthesis Markup Language (SSML) However, AFAIK, there aren't any epub3 apps with SSML support.

11-08-2021, 12:04 PM	#5
Simons Mith Member Posts: 20 Karma: 10 Join Date: Oct 2020 Device: none	Ah, thank you. Not so much a sleeping dog to be let lie, as a puppy whose eyes haven't opened yet. I'll not worry about it for now, but maybe revisit in a couple of years.

11-13-2021, 08:15 AM	#7
Quoth the rook, bossing Never. Posts: 11,096 Karma: 85874891 Join Date: Jun 2017 Location: Ireland Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11	I was using a text to speech markup on DOS in 1991 for UI of an Ice Cream making machine running on a PC. To save money the audio DAC wasn't a sound card but eight resistors on the parallel port driving a cheap IC amp on one of the custom controiller PCBs. The issue with a book is that a human has to proof read the entire book adding the markup and the CSS route is horribly flawed. There is such a disconnect written & pronounced and so many exceptions to rules that really natural text to speech needs a separate file. Actually English speech isn't quite the same language as written English. Compare a play, TV, film script (not Shakespeare) with novelization. Or a Radio soap with narrated book. Obviously if you just want narrated text then some system of escaping words with the spoken version works. But now is an actual audio book better than that and simply doing NOTHING to the source text and leaving it up to a best effort speech engine better than CSS speech extensions or SSML rules? And people not visually impaired now use audio books which was not the case 1899 to 1979.

11-19-2021, 06:24 PM	#12
Simons Mith Member Posts: 20 Karma: 10 Join Date: Oct 2020 Device: none	[extra info] Interesting. Thanks all. I can add a somewhat related datum from some other work I did some time back. This was another project entirely, about 5-7 years ago, and even the RNIB couldn't give a straight answer on the user preferences that I was trying to find out about. I eventually got some answers from actual users on a 'Blind' forum on Reddit. Afraid I don't remember exactly where I asked now, sorry. I'm not sure how true /this/ aspect is any more, but back then there were two schools of preference for audio books. Some people (more than half, but not overwhelmingly more) liked them read by a real person. Some people (a non-negligible minority) actually liked a robot voice. And they liked the robot voice to be as flavourless as possible, and they would dial it up to 200% speed (or faster, with practice) so that they could listen to an audiobook at super-speed. An interesting piece of feedback I got back then was that there were never enough different sound fonts (and probably never would be) and that the minority of people who did like robo-voices would tend to find one voice they liked and stick with it. They would have found the audio flourishes that I was wondering about adding to be extremely annoying. [Basically because they reached a state of flow where they stopped noticing the robo-voice altogether and could concentrate entirely on the text. Meddling with the voices in any way (e.g. raising the pitch or speed for a child's voice, using different voices for different characters in dialogue, any of that kind of stuff) would have broken the flow for them.] In principle the CSS was already defined to do those things, even back then, but no-one had implemented it, and the reason no one was rushing to implement it was because there was no discernible demand for it to /be/ implemented. I'd bet this is still true today. So while getting realistic voices that can emote and act may be a laudable goal, there's also a buncha people who like their robo-voice to be as bland as possible so that it doesn't intrude between them and the text. To be clear, adding character to robo-voices is a different objective from just getting them to pick the right pronunciation of dove, dove, bow, bow, either, either, potato, potato and so on, but the technology is related. We probably won't get either until the technology has advanced to the point where we can get both together.

11-20-2021, 07:46 AM	#14
Quoth the rook, bossing Never. Posts: 11,096 Karma: 85874891 Join Date: Jun 2017 Location: Ireland Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11	^^^^ Great points.

12-07-2021, 09:42 AM	#15
fabien.benoit.19 Junior Member Posts: 7 Karma: 10 Join Date: Dec 2021 Location: Minsk, Belarus Device: none	@Tex2002ans what tts software do you use actually? @Simons Mith Did you manage to find a good choice for your task?

Advert

Advert