Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Readers > PocketBook

Notices

Reply
 
Thread Tools Search this Thread
Old 11-30-2019, 08:57 AM   #1
Markismus
Guru
Markismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicing
 
Markismus's Avatar
 
Posts: 895
Karma: 149877
Join Date: Jul 2013
Location: Netherlands
Device: Cracked HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura
Understandability Text-to-speech

With the new pocketbook Inkpad 3 Pro, I finally tried TTS again. The Bluetooth connects directly to my hearing aids, so I have a clear and loud sound. I tried Kendra(US) and Amy(UK).

However, I have trouble understanding the text-to-speech if I don't concentrate fully on it. And I mean concentrate well beyond the concentration needed for writing this post or reading a book. I can't do simple jobs without losing large parts of the sentences. Talking it over I heard my mother had similar issues with TTS.
Did anyone else experience this?

Now part of it is of course learning, exposure and it will get easier with time. However, I already found Amy easier to follow than Kendra. Could voices be suggested for better understandability?
Markismus is offline   Reply With Quote
Old 11-30-2019, 03:59 PM   #2
ezdiy
Zealot
ezdiy can grok the meaning of the universe.ezdiy can grok the meaning of the universe.ezdiy can grok the meaning of the universe.ezdiy can grok the meaning of the universe.ezdiy can grok the meaning of the universe.ezdiy can grok the meaning of the universe.ezdiy can grok the meaning of the universe.ezdiy can grok the meaning of the universe.ezdiy can grok the meaning of the universe.ezdiy can grok the meaning of the universe.ezdiy can grok the meaning of the universe.
 
Posts: 121
Karma: 156515
Join Date: Oct 2019
Device: KT, KPW4, PB740-2
Your complaint is EXTREMELY COMMON with PRIMITIVE synthesis like IVONA. Yes, it's the SAME GIRL who reads Alexa. She sounds SO PRETTY, but there are BAD NEWS. Turns out Ivona is RETARDED, as she never understood PROSODY. Probably because it was never MARKED EXPLICITLY in books like I'm doing it RIGHT NOW.



There are far more clever TTS systems, the recent ones are Polly and Tacotron. Tacotron is opensource, including (reasonably useful) pretrained models, I think Polly is cloud only or something. TTS on pocketbook are pluggable, that is each installed voice provides their own libttsengine.so exposing ABI of https://github.com/blchinezu/pocketb...de/ttsengine.h

The reader then calls that when reading with that installed voice. If you were to go and implement state of the art TTS like Tacotron, you'd need to implement this wrapper library to glue it together. Currently, the wavenet synthesizer is research grade (you need to run whole tensorflow to evaluate the model), so I'm not sure PB would have enough horsepower to run it.
ezdiy is offline   Reply With Quote
Advert
Old 11-30-2019, 08:30 PM   #3
Tarana
Wizard
Tarana ought to be getting tired of karma fortunes by now.Tarana ought to be getting tired of karma fortunes by now.Tarana ought to be getting tired of karma fortunes by now.Tarana ought to be getting tired of karma fortunes by now.Tarana ought to be getting tired of karma fortunes by now.Tarana ought to be getting tired of karma fortunes by now.Tarana ought to be getting tired of karma fortunes by now.Tarana ought to be getting tired of karma fortunes by now.Tarana ought to be getting tired of karma fortunes by now.Tarana ought to be getting tired of karma fortunes by now.Tarana ought to be getting tired of karma fortunes by now.
 
Tarana's Avatar
 
Posts: 3,968
Karma: 38840460
Join Date: Sep 2012
Location: Minneapolis
Device: PWSE, Voyage, K3, HDX, KBasic 7 & 8, Nook Glo3, Echos, Nanos
I use the text-to-speech both with my Kindle Keyboards and Alexa. Takes about 3 chapters to get into the cadence, but probably 2-3 books before it took no more effort to listen than with a live speaker. The text-to-speech on the Echo is better than what is on the Fire (which is a marked improvement over the Kindle Keyboard). It may also depend on what you listen to. Fantasy doesn't work so well due to all the weird names. Murder mysteries and cozies work pretty well.
Tarana is offline   Reply With Quote
Old 11-30-2019, 10:19 PM   #4
ezdiy
Zealot
ezdiy can grok the meaning of the universe.ezdiy can grok the meaning of the universe.ezdiy can grok the meaning of the universe.ezdiy can grok the meaning of the universe.ezdiy can grok the meaning of the universe.ezdiy can grok the meaning of the universe.ezdiy can grok the meaning of the universe.ezdiy can grok the meaning of the universe.ezdiy can grok the meaning of the universe.ezdiy can grok the meaning of the universe.ezdiy can grok the meaning of the universe.
 
Posts: 121
Karma: 156515
Join Date: Oct 2019
Device: KT, KPW4, PB740-2
Quote:
Originally Posted by Tarana View Post
I use the text-to-speech both with my Kindle Keyboards and Alexa. Takes about 3 chapters to get into the cadence, but probably 2-3 books before it took no more effort to listen than with a live speaker. The text-to-speech on the Echo is better than what is on the Fire (which is a marked improvement over the Kindle Keyboard). It may also depend on what you listen to. Fantasy doesn't work so well due to all the weird names. Murder mysteries and cozies work pretty well.
If I remember correctly, Echo now uses Polly. It's not available on kindles because part of the synthesis runs on amazon servers and kindle is offline most of the time for battery life's sake. Note that polly is not particularly suitable for books, because it is mainly designed for "robot newcaster" where the text itself is machine produced, including prosody, breath etc SSML tags.

Tacotron on the other hand is "black box" algorithm. For it to read certain genre well, you feed it audiobooks as a source material, and it can learn prosody on its own. Even if it is fed neutral and generalist corpus devoid of "personality" typical to audiobook performers, the results are extremely lifelike - https://google.github.io/tacotron/pu...ion/index.html

Last edited by ezdiy; 11-30-2019 at 10:26 PM.
ezdiy is offline   Reply With Quote
Old 12-01-2019, 10:23 AM   #5
Markismus
Guru
Markismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicing
 
Markismus's Avatar
 
Posts: 895
Karma: 149877
Join Date: Jul 2013
Location: Netherlands
Device: Cracked HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura
@ezdiy Listening to the audio samples and especially the failures of Tacotron2, I do realize that there is rather a lot of room for improvement!

It seems NVIDIA published a tacotron2 version without wavenet. Would it be possible to couple it to a less computationally intensive synthesizer? They apparently have tensor cores dedicated to their Waveglow synthesizer. So it seems unfeasible to try and implement that on the pocketbook. Another possibility could be Mamah's implementation.

What about Polly? It seems Amazon asks for a subscription fee to use that. Are there ways around that? Or alternatives? How about using your own NAS as a server for the sound processing?

@Tarana Good to hear that there is a reasonably small learning curve. Too bad fantasy is harder. I already have problems with understanding names in real life (no context, just an unintelligible sound), so I'll probably never understand the TTS system.

Last edited by Markismus; 12-01-2019 at 03:23 PM.
Markismus is offline   Reply With Quote
Advert
Old 12-02-2019, 05:03 AM   #6
ezdiy
Zealot
ezdiy can grok the meaning of the universe.ezdiy can grok the meaning of the universe.ezdiy can grok the meaning of the universe.ezdiy can grok the meaning of the universe.ezdiy can grok the meaning of the universe.ezdiy can grok the meaning of the universe.ezdiy can grok the meaning of the universe.ezdiy can grok the meaning of the universe.ezdiy can grok the meaning of the universe.ezdiy can grok the meaning of the universe.ezdiy can grok the meaning of the universe.
 
Posts: 121
Karma: 156515
Join Date: Oct 2019
Device: KT, KPW4, PB740-2
@Markismus: Looked into polly, there's no way to run it on your own. Since I have an itch to produce a decent TTS for Kindle/PB, I'll just abuse this topic for some brainstorming.

Some intro into the problem. All TTS systems have backend and frontend part:



In case of Polly, the backend on the left and center runs on amazon server, and what you get locally is SSML/spectra annotation for vocoder on the right that runs on the cheap directly in real time on Echo. The data received is almost as compact as original text, just annotated with vocaloid stresses about how to pronounce it.

Tacotron as done by Google Assistant also has Polly-like architecture, but what goes on inbetween backend/frontend is vastly different - it's more like akin to nerve impulses for vocal cords: Tacotron predictor runs on google servers, and generates MEL spectra for wavenet vocoder running on phone. The spectrum is basically already audible voice, just very "poor recording" quality, it sounds like really badly compressed mp3 of otherwise nice human voice. Frontend for this weird thing is wavenet vocoder, pretty much NN image enchancing algorithm applied for sound. You may notice how Google Assistant turns uber robotic like Kindle, PocketBook or Siri TTS - that's when your phone gets offline all of sudden and it falls back to simple and fast rule based phoneme-to-speech, such as picotts when it can no longer receive tacotron backend data from the server. However by poking guts of assistant, it still seems encouraging, namely:

1) The wavenet frontend vocoder runs local, and is blazing fast (supposedly 20x of realtime).
2) Unoptimized tacotron2 itself in limited domain (one model with single voice and style) can run in near real time on CPU - on a PC, at least from looking at few implementations on github.. The limiting factor of opensource stuff seems to be the frontend wavenet vocoder, but really fast proprietary implementation of it exists, so its just engineering issue.

If it turns out readers are not powerful enough to run backend in the end (and since nobody does it in personal assistant space, seems like it might be a problem), there's still the route of writing a calibre plugin that converts 500kb ebook into 5mb @ 10 hours of vocoder data. Definitely much more viable than 300mb mp3s.

EDIT: looked into the nvidia stuff - nv-wavenet and waveglow are still wavenet, but with drastically different training procedure so that the model can be evaluated in batches on gpu without need for serial regression. This is also the case for Parallel Wavenet used in assistant (which avoids regressions at the cost of difficulty to keep the model stable). All these algorithms seem to be able to run 50-100 simultaneous channels on a single GPU. Question is when this is scaled down to ARM NEON, whether it will achieve at least 1 in RT.

Last edited by ezdiy; 12-02-2019 at 05:30 AM.
ezdiy is offline   Reply With Quote
Old 12-02-2019, 05:25 AM   #7
kacir
Wizard
kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.
 
kacir's Avatar
 
Posts: 3,447
Karma: 10484861
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
Quote:
Originally Posted by Markismus View Post
And I mean concentrate well beyond the concentration needed for writing this post or reading a book. I can't do simple jobs without losing large parts of the sentences.
I had similar problems with an audiobook.
With a written text, you set your own pace, with TTS or audiobook you have to pay attention constantly. If you stop paying attention for a moment, tpe place and context is lost.

Try an audiobook to see whether it is caused by a TTS voice or the audio format in general.
kacir is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
How to make Amazon Kindle Text to Speech skip over some text xsaero00 Kindle Developer's Corner 3 06-18-2011 07:09 PM
IQ Text-To-Speech pippopelo PocketBook 3 03-11-2011 02:15 PM
how can I use text to speech dardanus iRiver Story 1 02-21-2011 11:43 PM
Text to Speech NoLearningLimits Amazon Kindle 0 02-14-2011 12:32 AM
K2 Text to Speech SarahW Amazon Kindle 5 04-23-2010 10:18 PM


All times are GMT -4. The time now is 11:57 AM.


MobileRead.com is a privately owned, operated and funded community.