MobileRead Forums - View Single Post - Understandability Text-to-speech

ezdiy · 12-02-2019, 05:03 AM

@Markismus: Looked into polly, there's no way to run it on your own. Since I have an itch to produce a decent TTS for Kindle/PB, I'll just abuse this topic for some brainstorming.

Some intro into the problem. All TTS systems have backend and frontend part:

In case of Polly, the backend on the left and center runs on amazon server, and what you get locally is SSML/spectra annotation for vocoder on the right that runs on the cheap directly in real time on Echo. The data received is almost as compact as original text, just annotated with vocaloid stresses about how to pronounce it.

Tacotron as done by Google Assistant also has Polly-like architecture, but what goes on inbetween backend/frontend is vastly different - it's more like akin to nerve impulses for vocal cords: Tacotron predictor runs on google servers, and generates MEL spectra for wavenet vocoder running on phone. The spectrum is basically already audible voice, just very "poor recording" quality, it sounds like really badly compressed mp3 of otherwise nice human voice. Frontend for this weird thing is wavenet vocoder, pretty much NN image enchancing algorithm applied for sound. You may notice how Google Assistant turns uber robotic like Kindle, PocketBook or Siri TTS - that's when your phone gets offline all of sudden and it falls back to simple and fast rule based phoneme-to-speech, such as picotts when it can no longer receive tacotron backend data from the server. However by poking guts of assistant, it still seems encouraging, namely:

1) The wavenet frontend vocoder runs local, and is blazing fast (supposedly 20x of realtime).
2) Unoptimized tacotron2 itself in limited domain (one model with single voice and style) can run in near real time on CPU - on a PC, at least from looking at few implementations on github.. The limiting factor of opensource stuff seems to be the frontend wavenet vocoder, but really fast proprietary implementation of it exists, so its just engineering issue.

If it turns out readers are not powerful enough to run backend in the end (and since nobody does it in personal assistant space, seems like it might be a problem), there's still the route of writing a calibre plugin that converts 500kb ebook into 5mb @ 10 hours of vocoder data. Definitely much more viable than 300mb mp3s.

EDIT: looked into the nvidia stuff - nv-wavenet and waveglow are still wavenet, but with drastically different training procedure so that the model can be evaluated in batches on gpu without need for serial regression. This is also the case for Parallel Wavenet used in assistant (which avoids regressions at the cost of difficulty to keep the model stable). All these algorithms seem to be able to run 50-100 simultaneous channels on a single GPU. Question is when this is scaled down to ARM NEON, whether it will achieve at least 1 in RT.

12-02-2019, 05:03 AM	#6
ezdiy Zealot Posts: 121 Karma: 156515 Join Date: Oct 2019 Device: KT, KPW4, PB740-2	@Markismus: Looked into polly, there's no way to run it on your own. Since I have an itch to produce a decent TTS for Kindle/PB, I'll just abuse this topic for some brainstorming. Some intro into the problem. All TTS systems have backend and frontend part: In case of Polly, the backend on the left and center runs on amazon server, and what you get locally is SSML/spectra annotation for vocoder on the right that runs on the cheap directly in real time on Echo. The data received is almost as compact as original text, just annotated with vocaloid stresses about how to pronounce it. Tacotron as done by Google Assistant also has Polly-like architecture, but what goes on inbetween backend/frontend is vastly different - it's more like akin to nerve impulses for vocal cords: Tacotron predictor runs on google servers, and generates MEL spectra for wavenet vocoder running on phone. The spectrum is basically already audible voice, just very "poor recording" quality, it sounds like really badly compressed mp3 of otherwise nice human voice. Frontend for this weird thing is wavenet vocoder, pretty much NN image enchancing algorithm applied for sound. You may notice how Google Assistant turns uber robotic like Kindle, PocketBook or Siri TTS - that's when your phone gets offline all of sudden and it falls back to simple and fast rule based phoneme-to-speech, such as picotts when it can no longer receive tacotron backend data from the server. However by poking guts of assistant, it still seems encouraging, namely: 1) The wavenet frontend vocoder runs local, and is blazing fast (supposedly 20x of realtime). 2) Unoptimized tacotron2 itself in limited domain (one model with single voice and style) can run in near real time on CPU - on a PC, at least from looking at few implementations on github.. The limiting factor of opensource stuff seems to be the frontend wavenet vocoder, but really fast proprietary implementation of it exists, so its just engineering issue. If it turns out readers are not powerful enough to run backend in the end (and since nobody does it in personal assistant space, seems like it might be a problem), there's still the route of writing a calibre plugin that converts 500kb ebook into 5mb @ 10 hours of vocoder data. Definitely much more viable than 300mb mp3s. EDIT: looked into the nvidia stuff - nv-wavenet and waveglow are still wavenet, but with drastically different training procedure so that the model can be evaluated in batches on gpu without need for serial regression. This is also the case for Parallel Wavenet used in assistant (which avoids regressions at the cost of difficulty to keep the model stable). All these algorithms seem to be able to run 50-100 simultaneous channels on a single GPU. Question is when this is scaled down to ARM NEON, whether it will achieve at least 1 in RT. Last edited by ezdiy; 12-02-2019 at 05:30 AM.