Custom TextToSpeech "engine" as a way to create custom real audiobooks with texts

noisy · 03-30-2012, 01:22 PM

This is only an idea, however it would be great if someone could explain is it possible to do such thing...

As far as I know, there is no possibility of creating real audiobooks, since .aa and .aax are are not open formats protected by DRM.

Regular TTS voice is not so great.

There is possibility, to change the standard "voice" of TTS mechanism. I guess that this is not a "voice" but a special kind of program which know how to pronounce each letter in specific language.

Is it possible to write on TTS program, which will play proper mp3 file from data store according to specified in information delivered in customizable new file format, like .openaa ( if paragraph starts with "Once upon a time" play Cinderella/1.mp3, etc)

The idea is to give possibility of creating audiobooks associated with real text of whole book.

I can buy an audiobook, I can buy an ebook, but (on kindle) I can't listen conveniently audiobook while reading a book (what would be great for learning language purpose).

ixtab · 03-30-2012, 04:39 PM

Hmmm...

if I correctly understood what you mean, this is indeed an interesting thought.

I'll try to use an analogy, even if it doesn't fit entirely: So what you have in mind is something similar to what "subtitles" are for movies, or lyrics for songs, but "the other way around", right? Like, you are reading a chapter, and the Kindle would speak along as you read.

In what follows, I'm only addressing the technical issues that come to my mind.

First, if you're going in the "text->sound" direction, it's (at least initially) pretty hard to match what you're seeing (a page of text) to what you're hearing (a stream of words). You can probably reproduce that by simply turning on TTS somewhere in a book. I always find myself scanning the page for up to 10 seconds to even find which passage is currently being read. Even assuming the text output was perfect (audiobook quality), this problem would persist, because you have no visual clue of what is currently being read. This is less of a problem once you found the current "audio" position and are then just following along reading.

Still, it suggests that you most probably want to synchronize the text output with the audio output, and not the other way around. Example:
"This is a sample text" (2 secs audio).

0:00.000 -> This is a sample text
0:00.500 -> This is a sample text
0:00.800 -> This is a sample text
0:01.000 -> This is a sample text
0:01.600 -> This is a sample text

These values are arbitrary and for demonstration only, but I hope it's clear what I mean.

This would provide the advantage of always knowing what is currently being spoken, but has two major disadvantages: It would require an enormous amount of metadata (1 entry per word), which has to be manually created (quite simply impossible, unless you have loads of $$$ to throw out of the window), and it would actually be stressing the reader. So a more reasonable version might be to associate sound chunks with paragraphs of text (pretty useful IMO) or even pages.

Another method might be to insert "synchronization marks" every x seconds. I'm not an audiobook aficionado (in fact I only ever listened to "audiobooks" by coincidence on the radio, while driving on the highway), but from what I quickly googled, the complete LOTR audiobook is about 55 hours (3300 mins) and ~ 1200 (physical) pages, so (very) roughly 3 mins/page.

So,

1 mark/ 10 secs = 19800 marks, 16 mpp (marks per page)
1 mark/ 30 secs = 6600 marks, 5.5 mpp
1 mark/ min = 3300 marks, 2.75 mpp

I would personally think the "middle ground" is acceptable here, both in terms of "finding yourself around", i.e., synchronizing what you're reading and hearing, and in terms of "picking up where you left from". It would still mean that to prepare such a book/audiobook combination, someone would have to listen AND read for 55 hours, clicking on the "I am here" word every 30 seconds. I've done similar (not identical) tasks before, and I can assure you it's extremely tedious: you are doing an extremely dumb job, yet you must be totally concentrated.

That put aside, a method of combining the abovementioned media formats must be found. Is it an MP3 file, a MOBI file, and a (for instance) SYNC file? What if either of these files is not in sync with the other two? Is a single container file (say, .EWA -- ebook with audio --) the better choice to go? Does one have to invent such a format from the ground up, or could other existing formats (like subtitles, lyrics,...) be reused or at least be used for inspiration?

And finally: how could this be implemented, and integrated, on the Kindle at all? Would it be possible to write this once, then use everywhere (K2, K3, K5)* ? Or would one need to adapt it more or less heavily to every single model?

OK, I realize I wrote quite a bit of text. The purpose was not to intimidate you, or to slay your question. On the contrary -- as said, I do find this a very interesting topic. Otherwise, I wouldn't have spent more than an hour writing this and researching some of the background. I'm only trying to realistically answer your question about the feasibility of such a project, but I don't know how experienced you are in developing software.

So to wrap it up in one line, and from my perspective: It's indeed a very ambitious project, which will need a lot of time, a lot of smart ideas, and even more dedication. Conclusion? Go for it! Come up with ideas, proof-of-concepts, alpha versions etc. There are a lot of very smart people around in this area of mobileread, so I bet that you will find some talented folks who are interested, and willing to join in and contribute.

(*) K4 not considered because it doesn't have speakers (AFAIK). I may have gotten other models wrong as well.

noisy · 03-31-2012, 08:42 AM

You understand me very well. Of course "synchronization marks" are strictly related with amount of mp3 files. Turning on new file every 10 second probably might cause some delays, what is not desirable.

Quote:

Originally Posted by ixtab

1 mark/ 30 secs = 6600 marks, 5.5 mpp

Could be a good factor. However... I think LOTR is not this first ebook which come to my mind when I hit on this an idea.

I think this will be better for shorter texts, even for podcats which provide transcription.

I assume, that original TTS mechanism gets part of text and keeps it in some kind of buffer. The question is how to write program, which could pretend TTS and use this same kind of buffer.

03-30-2012, 01:22 PM	#1
noisy Member Posts: 22 Karma: 12 Join Date: Oct 2011 Device: kindle 3	Custom TextToSpeech "engine" as a way to create custom real audiobooks with texts This is only an idea, however it would be great if someone could explain is it possible to do such thing... As far as I know, there is no possibility of creating real audiobooks, since .aa and .aax are are not open formats protected by DRM. Regular TTS voice is not so great. There is possibility, to change the standard "voice" of TTS mechanism. I guess that this is not a "voice" but a special kind of program which know how to pronounce each letter in specific language. Is it possible to write on TTS program, which will play proper mp3 file from data store according to specified in information delivered in customizable new file format, like .openaa ( if paragraph starts with "Once upon a time" play Cinderella/1.mp3, etc) The idea is to give possibility of creating audiobooks associated with real text of whole book. I can buy an audiobook, I can buy an ebook, but (on kindle) I can't listen conveniently audiobook while reading a book (what would be great for learning language purpose).

03-30-2012, 04:39 PM	#2
ixtab (offline) Posts: 2,907 Karma: 6736092 Join Date: Dec 2011 Device: K3, K4, K5, KPW, KPW2	Hmmm... if I correctly understood what you mean, this is indeed an interesting thought. I'll try to use an analogy, even if it doesn't fit entirely: So what you have in mind is something similar to what "subtitles" are for movies, or lyrics for songs, but "the other way around", right? Like, you are reading a chapter, and the Kindle would speak along as you read. In what follows, I'm only addressing the technical issues that come to my mind. First, if you're going in the "text->sound" direction, it's (at least initially) pretty hard to match what you're seeing (a page of text) to what you're hearing (a stream of words). You can probably reproduce that by simply turning on TTS somewhere in a book. I always find myself scanning the page for up to 10 seconds to even find which passage is currently being read. Even assuming the text output was perfect (audiobook quality), this problem would persist, because you have no visual clue of what is currently being read. This is less of a problem once you found the current "audio" position and are then just following along reading. Still, it suggests that you most probably want to synchronize the text output with the audio output, and not the other way around. Example: "This is a sample text" (2 secs audio). 0:00.000 -> This is a sample text 0:00.500 -> This is a sample text 0:00.800 -> This is a sample text 0:01.000 -> This is a sample text 0:01.600 -> This is a sample text These values are arbitrary and for demonstration only, but I hope it's clear what I mean. This would provide the advantage of always knowing what is currently being spoken, but has two major disadvantages: It would require an enormous amount of metadata (1 entry per word), which has to be manually created (quite simply impossible, unless you have loads of $$$ to throw out of the window), and it would actually be stressing the reader. So a more reasonable version might be to associate sound chunks with paragraphs of text (pretty useful IMO) or even pages. Another method might be to insert "synchronization marks" every x seconds. I'm not an audiobook aficionado (in fact I only ever listened to "audiobooks" by coincidence on the radio, while driving on the highway), but from what I quickly googled, the complete LOTR audiobook is about 55 hours (3300 mins) and ~ 1200 (physical) pages, so (very) roughly 3 mins/page. So, 1 mark/ 10 secs = 19800 marks, 16 mpp (marks per page) 1 mark/ 30 secs = 6600 marks, 5.5 mpp 1 mark/ min = 3300 marks, 2.75 mpp I would personally think the "middle ground" is acceptable here, both in terms of "finding yourself around", i.e., synchronizing what you're reading and hearing, and in terms of "picking up where you left from". It would still mean that to prepare such a book/audiobook combination, someone would have to listen AND read for 55 hours, clicking on the "I am here" word every 30 seconds. I've done similar (not identical) tasks before, and I can assure you it's extremely tedious: you are doing an extremely dumb job, yet you must be totally concentrated. That put aside, a method of combining the abovementioned media formats must be found. Is it an MP3 file, a MOBI file, and a (for instance) SYNC file? What if either of these files is not in sync with the other two? Is a single container file (say, .EWA -- ebook with audio --) the better choice to go? Does one have to invent such a format from the ground up, or could other existing formats (like subtitles, lyrics,...) be reused or at least be used for inspiration? And finally: how could this be implemented, and integrated, on the Kindle at all? Would it be possible to write this once, then use everywhere (K2, K3, K5)* ? Or would one need to adapt it more or less heavily to every single model? OK, I realize I wrote quite a bit of text. The purpose was not to intimidate you, or to slay your question. On the contrary -- as said, I do find this a very interesting topic. Otherwise, I wouldn't have spent more than an hour writing this and researching some of the background. I'm only trying to realistically answer your question about the feasibility of such a project, but I don't know how experienced you are in developing software. So to wrap it up in one line, and from my perspective: It's indeed a very ambitious project, which will need a lot of time, a lot of smart ideas, and even more dedication. Conclusion? Go for it! Come up with ideas, proof-of-concepts, alpha versions etc. There are a lot of very smart people around in this area of mobileread, so I bet that you will find some talented folks who are interested, and willing to join in and contribute. (*) K4 not considered because it doesn't have speakers (AFAIK). I may have gotten other models wrong as well.

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Custom column not recognized in "Sending books to devices"	cheveguerra	Devices	4	12-05-2011 01:58 PM
Custom column: "Updated date", when adding new "versions" of the same file?	enriquep	Library Management	16	11-03-2011 10:46 AM
Sony PRS-T1 and plugboards "tags" from custom column	salines	Devices	8	10-31-2011 03:00 AM
Custom boot logo, "freezing" the screen	guylhem	Sony Reader Dev Corner	1	11-09-2008 11:45 AM

Advert