MobileRead Forums - View Single Post - Telling a text-to-speech reader how to pronounce things?

Tex2002ans · 11-13-2021, 12:52 PM

Quote:

Originally Posted by Quoth

There is such a disconnect written & pronounced and so many exceptions to rules that really natural text to speech needs a separate file.

Just like grammarchecking, you need a completely different level of parsing to break down words.

Language also changes over time, and new spellings/usages/accents/pronunciations constantly come into play.

Take this example:

The bow bowed back, then I shot across the bow. In awe, the servants bowed before me.

1 = bow, as in bow and arrow
2 = bowed, as in bending
3 = bow, as in a warning shot
4 = bowed, as in kneeling + lowering head

The first 2 are said with 'b' + "OH" sound.

The next 2 are said with 'b' + "OW" sound.

Another good example is:

The colonels popped kernels of popcorn in the microwave.

Both words are spoken exactly the same (in current-day English), but that's not how it always was.

For more information on this, I recommend the fantastic podcast, "Lexicon Valley" by John McWhorter.

Here's a few episodes covering:

Side Note: Just a few months ago, McWhorter handed the podcast off to two other people (so now the original podcast has confusingly been name-changed to "Spectacular Vernacular").

But you can find him at the new "Lexicon Valley":

https://www.booksmartstudios.org/s/lexicon-valley

Here's the first episode from the new version:

"English Has a Bee in Its Bonnet"

where he explains where the heck "bee" in "spelling bee" comes from. (And other fascinating stuff.)

Quote:

Originally Posted by Quoth

But now is an actual audio book better than that and simply doing NOTHING to the source text and leaving it up to a best effort speech engine better than CSS speech extensions or SSML rules?

And people not visually impaired now use audio books which was not the case 1899 to 1979.

The better TTS engines/networks get, the better these things can do with plaintext input. (Toss some samples into Google's Cloud Text-to-Speech and see how it sounds.)

The fantastic thing about Text-to-Speech is you don't need a human middleman to read the stuff.

99%+ of written text wouldn't be accessible to the blind—think bills/letters/flyers/boxes/cans + dynamically generated content (phone numbers, addresses, dates, names, $ amounts, auto-translated text).

And many times, there's very personal information inside—think texts between spouses or emails between friends. (Are blind people supposed to have zero privacy?)

One of the best talks I ever saw on this topic was from 2013:

Ron McCallum: "How Technology Allowed Me to Read" (TEDxSydney)

Definitely give it a listen.

Side Note: Personally, a lot of the journals/books I read are so obscure that there would never be a market for human-read audiobook versions. But with Text-to-Speech, I can listen to anything/everything while I work.

A "90% good" TTS version of the ebook is 100% better than 0% human-read.

And if you compare the quality of Android/Google's TTS vs. the robotic crap on Windows, it's pretty close to a human reading to me (besides wrongly pronouncing odd names, obscure words, and "bow" vs. "bow").

That high-quality, bleeding-edge TTS will trickle its way down into the OSes themselves, and if we stop back in another 10 years, you'll see all that breathing+mood+other enhancements make their way down to the free version sitting right inside your pocket.

And those that create ebooks can do their best to take reasonable measures with markup... like marking the proper language so "tacos" (English) + "tacos" (Spanish) can be pronounced correctly (at some near-future date!). That would be infinitely more helpful than manually trying to insert CSS Speech + you can actually benefit from language markup now.