MobileRead Forums - View Single Post - Telling a text-to-speech reader how to pronounce things?

Tex2002ans · 11-12-2021, 10:58 PM

Quote:

Originally Posted by Simons Mith

Is there a way to tell the text-to-speech reader how to pronounce tricky stuff correctly?

[...] can I embed that information in the epub so that the users don't have to?

Yes* and no.

Quote:

Originally Posted by Doitsu

There's actually a W3C draft:

EPUB 3 Text-to-Speech Enhancements 1.0

[...]

However, AFAIK, there aren't any epub3 apps with SSML support.

The EPUB3 specs are based on "CSS3 Speech"... and like you said, no e-reader actually supports this.

Back in 2018, I emailed an ex-MR user (who now works for one of the largest Text-to-Speech companies) about this very topic.

About CSS3 Speech, he told me:

"On paper, nice, but everyone in the industry uses SSML 1.0 or 1.1. In statistical terms, very few will care about adding Speech CSS to their HTML5 documents."

If you're interested in Text-to-Speech (TTS), he also recommended the Interspeech conference. That's a lot of the bleeding edge information about parsing, processing, and generating the highest quality speech.

* * *

From my recalling, what would typically happen is:

1. You have a preprocessor which inputs the text/HTML, then converts it into SSML.

2 (Optional). You can add manual hints to the SSML, like adding:

specific pronunciations
- Like odd First/Last Names.
emphasis
- In HTML it's <em>, in SSML it's <emphasis>.
which language the text is in
- "tacos" in English ≠ "tacos" in Spanish
Male/Female (or Child/Adult) voice

+ parse the grammar,* so you can tell differences between:

The Polish army invaded the country over stolen shoe polish.
- "Polish" (the country) vs. "polish" (like 'polishing a shoe')
100 m
- "100 meters" vs. "100 'EM'"
2x × y / (3z + 4)
- "2 'EX' 'TIMES' 'WHY' 'OVER' 3 'ZEE' 'PLUS' 4"
I was smacked over the head with a 2×4.
- "I was smacked over the head with a 2 'BY' 4."

3. You feed that SSML into a TTS engine/backend, like:

Amazon Polly
Nuance Text-to-Speech
Microsoft Azure Text-to-Speech
Mycroft.ai
[...]

and they send you an audio file back.

4. You play the audio file.

* * *

Anyway, I haven't really done much research into it since 2018. Looks like I have quite a few years of conferences to catch up on.

For that step 2, a ton of the research goes into computing this stuff automatically.

So you can just feed it text, and "the cloud"/AI will figure out correct pronunciations + how to make it sound as realistic as possible:

"breathing"
"mood"
- Happy/Sad/Excited
pauses after punctuation
- and not pausing after period in a middle initial: "Monkey D. Luffy" + "P.T. Barnum"
emphasis
[...]

Side Note: To see some of the amazing advances within the past few years... there was a mod for the best-selling games from 2015, "The Witcher 3".

Earlier this year, a user took the dozens of hours of audio from the game, fed it into a neural network, then used that to generate completely new dialogue via TTS:

Youtube: "[Witcher 3] New Quest MOD - A Night to Remember (trailer)"

Games can now use similar techniques to generate all sorts of languages + automatically lip-sync.

And movies and TV shows can also redub, having the "original actors" speaking fluently in the dubbed language. (I believe this has already been used in the Marvel superhero movies.)

Quote:

Originally Posted by Jellby

Some nitpicking: The correct typesetting is "1,000 km" and "12 m", with a non-breaking, non-stretching space if you want, but with a space.

The "no space before units" is an extremely common error. For more info, see Wikipedia: "International System of Units > Lexicographic conventions > General Rules".

Though if you feed this into TTS engines, they typically get units correct (space or no space).