Presumably this would not require the publishers to have synced page location markers in the text and audio.
It's entirely automatic using speech recognition and comparison to the text (in a lose manner, I'd suspect, as some of the accents of narrators are such that Google voice is going to make lots of errors, at least from my experience of their voice-to-text voicemail system).
|