Quote:
Originally Posted by arjaybe
Isn't text-to-speech these days using AI to read?
|
As somebody that has implemented a TTS system in the calibre reader, no it isnt. What happens is the text is converted to phonemes based on the language the text is in. Training data is thus a sequence of phonemes along with the waveform they generate. When you send text to the model to convert to speech that text also gets converted to phonemes before being fed to the model.
As an aside, these TTS models run on exactly the same architecture as LLMs. Indeed LLMs dont care that they are being fed tet or phonemes or pixel data or whatever, it's all just treated as sequences of bytes.