MobileRead Forums - View Single Post - Telling a text-to-speech reader how to pronounce things?

Simons Mith · 11-19-2021, 06:24 PM

[extra info] Interesting. Thanks all.

I can add a somewhat related datum from some other work I did some time back.

This was another project entirely, about 5-7 years ago, and even the RNIB couldn't give a straight answer on the user preferences that I was trying to find out about. I eventually got some answers from actual users on a 'Blind' forum on Reddit. Afraid I don't remember exactly where I asked now, sorry.

I'm not sure how true /this/ aspect is any more, but back then there were two schools of preference for audio books. Some people (more than half, but not overwhelmingly more) liked them read by a real person. Some people (a non-negligible minority) actually liked a robot voice. And they liked the robot voice to be as flavourless as possible, and they would dial it up to 200% speed (or faster, with practice) so that they could listen to an audiobook at super-speed.

An interesting piece of feedback I got back then was that there were never enough different sound fonts (and probably never would be) and that the minority of people who did like robo-voices would tend to find one voice they liked and stick with it. They would have found the audio flourishes that I was wondering about adding to be extremely annoying. [Basically because they reached a state of flow where they stopped noticing the robo-voice altogether and could concentrate entirely on the text. Meddling with the voices in any way (e.g. raising the pitch or speed for a child's voice, using different voices for different characters in dialogue, any of that kind of stuff) would have broken the flow for them.] In principle the CSS was already defined to do those things, even back then, but no-one had implemented it, and the reason no one was rushing to implement it was because there was no discernible demand for it to /be/ implemented. I'd bet this is still true today.

So while getting realistic voices that can emote and act may be a laudable goal, there's also a buncha people who like their robo-voice to be as bland as possible so that it doesn't intrude between them and the text.

To be clear, adding character to robo-voices is a different objective from just getting them to pick the right pronunciation of dove, dove, bow, bow, either, either, potato, potato and so on, but the technology is related. We probably won't get either until the technology has advanced to the point where we can get both together.

11-19-2021, 06:24 PM	#12
Simons Mith Member Posts: 23 Karma: 10 Join Date: Oct 2020 Device: none	[extra info] Interesting. Thanks all. I can add a somewhat related datum from some other work I did some time back. This was another project entirely, about 5-7 years ago, and even the RNIB couldn't give a straight answer on the user preferences that I was trying to find out about. I eventually got some answers from actual users on a 'Blind' forum on Reddit. Afraid I don't remember exactly where I asked now, sorry. I'm not sure how true /this/ aspect is any more, but back then there were two schools of preference for audio books. Some people (more than half, but not overwhelmingly more) liked them read by a real person. Some people (a non-negligible minority) actually liked a robot voice. And they liked the robot voice to be as flavourless as possible, and they would dial it up to 200% speed (or faster, with practice) so that they could listen to an audiobook at super-speed. An interesting piece of feedback I got back then was that there were never enough different sound fonts (and probably never would be) and that the minority of people who did like robo-voices would tend to find one voice they liked and stick with it. They would have found the audio flourishes that I was wondering about adding to be extremely annoying. [Basically because they reached a state of flow where they stopped noticing the robo-voice altogether and could concentrate entirely on the text. Meddling with the voices in any way (e.g. raising the pitch or speed for a child's voice, using different voices for different characters in dialogue, any of that kind of stuff) would have broken the flow for them.] In principle the CSS was already defined to do those things, even back then, but no-one had implemented it, and the reason no one was rushing to implement it was because there was no discernible demand for it to /be/ implemented. I'd bet this is still true today. So while getting realistic voices that can emote and act may be a laudable goal, there's also a buncha people who like their robo-voice to be as bland as possible so that it doesn't intrude between them and the text. To be clear, adding character to robo-voices is a different objective from just getting them to pick the right pronunciation of dove, dove, bow, bow, either, either, potato, potato and so on, but the technology is related. We probably won't get either until the technology has advanced to the point where we can get both together.