MobileRead Forums - View Single Post - Custom TextToSpeech "engine" as a way to create custom real audiobooks with texts

ixtab · 03-30-2012, 05:39 PM

Hmmm...

if I correctly understood what you mean, this is indeed an interesting thought.

I'll try to use an analogy, even if it doesn't fit entirely: So what you have in mind is something similar to what "subtitles" are for movies, or lyrics for songs, but "the other way around", right? Like, you are reading a chapter, and the Kindle would speak along as you read.

In what follows, I'm only addressing the technical issues that come to my mind.

First, if you're going in the "text->sound" direction, it's (at least initially) pretty hard to match what you're seeing (a page of text) to what you're hearing (a stream of words). You can probably reproduce that by simply turning on TTS somewhere in a book. I always find myself scanning the page for up to 10 seconds to even find which passage is currently being read. Even assuming the text output was perfect (audiobook quality), this problem would persist, because you have no visual clue of what is currently being read. This is less of a problem once you found the current "audio" position and are then just following along reading.

Still, it suggests that you most probably want to synchronize the text output with the audio output, and not the other way around. Example:
"This is a sample text" (2 secs audio).

0:00.000 -> This is a sample text
0:00.500 -> This is a sample text
0:00.800 -> This is a sample text
0:01.000 -> This is a sample text
0:01.600 -> This is a sample text

These values are arbitrary and for demonstration only, but I hope it's clear what I mean.

This would provide the advantage of always knowing what is currently being spoken, but has two major disadvantages: It would require an enormous amount of metadata (1 entry per word), which has to be manually created (quite simply impossible, unless you have loads of $$$ to throw out of the window), and it would actually be stressing the reader. So a more reasonable version might be to associate sound chunks with paragraphs of text (pretty useful IMO) or even pages.

Another method might be to insert "synchronization marks" every x seconds. I'm not an audiobook aficionado (in fact I only ever listened to "audiobooks" by coincidence on the radio, while driving on the highway), but from what I quickly googled, the complete LOTR audiobook is about 55 hours (3300 mins) and ~ 1200 (physical) pages, so (very) roughly 3 mins/page.

So,

1 mark/ 10 secs = 19800 marks, 16 mpp (marks per page)
1 mark/ 30 secs = 6600 marks, 5.5 mpp
1 mark/ min = 3300 marks, 2.75 mpp

I would personally think the "middle ground" is acceptable here, both in terms of "finding yourself around", i.e., synchronizing what you're reading and hearing, and in terms of "picking up where you left from". It would still mean that to prepare such a book/audiobook combination, someone would have to listen AND read for 55 hours, clicking on the "I am here" word every 30 seconds. I've done similar (not identical) tasks before, and I can assure you it's extremely tedious: you are doing an extremely dumb job, yet you must be totally concentrated.

That put aside, a method of combining the abovementioned media formats must be found. Is it an MP3 file, a MOBI file, and a (for instance) SYNC file? What if either of these files is not in sync with the other two? Is a single container file (say, .EWA -- ebook with audio --) the better choice to go? Does one have to invent such a format from the ground up, or could other existing formats (like subtitles, lyrics,...) be reused or at least be used for inspiration?

And finally: how could this be implemented, and integrated, on the Kindle at all? Would it be possible to write this once, then use everywhere (K2, K3, K5)* ? Or would one need to adapt it more or less heavily to every single model?

OK, I realize I wrote quite a bit of text. The purpose was not to intimidate you, or to slay your question. On the contrary -- as said, I do find this a very interesting topic. Otherwise, I wouldn't have spent more than an hour writing this and researching some of the background. I'm only trying to realistically answer your question about the feasibility of such a project, but I don't know how experienced you are in developing software.

So to wrap it up in one line, and from my perspective: It's indeed a very ambitious project, which will need a lot of time, a lot of smart ideas, and even more dedication. Conclusion? Go for it! Come up with ideas, proof-of-concepts, alpha versions etc. There are a lot of very smart people around in this area of mobileread, so I bet that you will find some talented folks who are interested, and willing to join in and contribute.

(*) K4 not considered because it doesn't have speakers (AFAIK). I may have gotten other models wrong as well.

03-30-2012, 05:39 PM	#2
ixtab (offline) Posts: 2,907 Karma: 6736094 Join Date: Dec 2011 Device: K3, K4, K5, KPW, KPW2	Hmmm... if I correctly understood what you mean, this is indeed an interesting thought. I'll try to use an analogy, even if it doesn't fit entirely: So what you have in mind is something similar to what "subtitles" are for movies, or lyrics for songs, but "the other way around", right? Like, you are reading a chapter, and the Kindle would speak along as you read. In what follows, I'm only addressing the technical issues that come to my mind. First, if you're going in the "text->sound" direction, it's (at least initially) pretty hard to match what you're seeing (a page of text) to what you're hearing (a stream of words). You can probably reproduce that by simply turning on TTS somewhere in a book. I always find myself scanning the page for up to 10 seconds to even find which passage is currently being read. Even assuming the text output was perfect (audiobook quality), this problem would persist, because you have no visual clue of what is currently being read. This is less of a problem once you found the current "audio" position and are then just following along reading. Still, it suggests that you most probably want to synchronize the text output with the audio output, and not the other way around. Example: "This is a sample text" (2 secs audio). 0:00.000 -> This is a sample text 0:00.500 -> This is a sample text 0:00.800 -> This is a sample text 0:01.000 -> This is a sample text 0:01.600 -> This is a sample text These values are arbitrary and for demonstration only, but I hope it's clear what I mean. This would provide the advantage of always knowing what is currently being spoken, but has two major disadvantages: It would require an enormous amount of metadata (1 entry per word), which has to be manually created (quite simply impossible, unless you have loads of $$$ to throw out of the window), and it would actually be stressing the reader. So a more reasonable version might be to associate sound chunks with paragraphs of text (pretty useful IMO) or even pages. Another method might be to insert "synchronization marks" every x seconds. I'm not an audiobook aficionado (in fact I only ever listened to "audiobooks" by coincidence on the radio, while driving on the highway), but from what I quickly googled, the complete LOTR audiobook is about 55 hours (3300 mins) and ~ 1200 (physical) pages, so (very) roughly 3 mins/page. So, 1 mark/ 10 secs = 19800 marks, 16 mpp (marks per page) 1 mark/ 30 secs = 6600 marks, 5.5 mpp 1 mark/ min = 3300 marks, 2.75 mpp I would personally think the "middle ground" is acceptable here, both in terms of "finding yourself around", i.e., synchronizing what you're reading and hearing, and in terms of "picking up where you left from". It would still mean that to prepare such a book/audiobook combination, someone would have to listen AND read for 55 hours, clicking on the "I am here" word every 30 seconds. I've done similar (not identical) tasks before, and I can assure you it's extremely tedious: you are doing an extremely dumb job, yet you must be totally concentrated. That put aside, a method of combining the abovementioned media formats must be found. Is it an MP3 file, a MOBI file, and a (for instance) SYNC file? What if either of these files is not in sync with the other two? Is a single container file (say, .EWA -- ebook with audio --) the better choice to go? Does one have to invent such a format from the ground up, or could other existing formats (like subtitles, lyrics,...) be reused or at least be used for inspiration? And finally: how could this be implemented, and integrated, on the Kindle at all? Would it be possible to write this once, then use everywhere (K2, K3, K5)* ? Or would one need to adapt it more or less heavily to every single model? OK, I realize I wrote quite a bit of text. The purpose was not to intimidate you, or to slay your question. On the contrary -- as said, I do find this a very interesting topic. Otherwise, I wouldn't have spent more than an hour writing this and researching some of the background. I'm only trying to realistically answer your question about the feasibility of such a project, but I don't know how experienced you are in developing software. So to wrap it up in one line, and from my perspective: It's indeed a very ambitious project, which will need a lot of time, a lot of smart ideas, and even more dedication. Conclusion? Go for it! Come up with ideas, proof-of-concepts, alpha versions etc. There are a lot of very smart people around in this area of mobileread, so I bet that you will find some talented folks who are interested, and willing to join in and contribute. (*) K4 not considered because it doesn't have speakers (AFAIK). I may have gotten other models wrong as well.