Below is a detailed description about how I plan to add sound into my GMV (geekmaster video) file format, in a (mostly) backward-compatible way. Also described is a little bit about how to process that data to overcome potential problems caused by eink update time variability. Click the show button if that sort of thing interests you.
After the GMV files have embedded sound, I will change the gmplay 2.0 code you can see in the previous post so that it can play this new audio sound track on the kindle speakers (or headphones) while simultaneously playing the video on the kindle eink screen.
I think I have mostly decided what format to use for adding sound to GMV files. Now they are raw dithered 800x600 bit-packed video frames with no metadata (60,000 bytes/frame). To add sound, I insert 60,000-byte audio frames as needed. If a GMV file contains audio, then the first frame will be an audio frame. Audio frames will contain a frame header with a unique signature (magic number) and WAVE style sound metadata. Because each audio frame has its own metadata, we can switch encoding methods for each audio frame. This will allow using A-Law for simple encoding, or raw 16-bit sound, or celp/speex, celt, or mp3 codecs, depending on the content at THAT POINT in the video. Would could have speex for a lecture and mp3 for intro music IN THE SAME VIDEO.
The audio metadata will also say how many frames until the next audio frame, so a simple player only needs to count frames (no need to search for metadata signatures).
The player will buffer 1MB of data before starting to play, so it can synchronize the sound and video together.
The complication is that gmplay uses 130 msec frames because that is the best AVERAGE framerate that a K3 can do. Any faster and complex scenes get nasty artifacts. Any slower and the animation gets too jerky. But that is only an AVERAGE. On the K3, the eink update does not return until the framebuffer is ready to be written with new data (130 msec, more or less). On the K4/K5, the eink update returns almost immediately and we wait the rest of the 130 msec frame time before writing to the framebuffer, so the videos play at the same average speed on all kindles.
For simple scenes, the K3 can return much faster than 130 msec, but for complex scene changes, it can take up to 300 msec to return (such as when cutting from a dark scene to a bright scene). Sometimes, mutiple scenes changes within 1 second can cause the K3 to fall up to one second behind (and occasionally more in some videos). If it falls more than one second behind, gmplay drops frames to keep it from falling more than one second behind. When the video gets back to simple scene changes, the K3 can catch back up to "real time".
If you start framedropping before one second delay, most videos can get too many frame drops, or can even get into a state where every-other frame gets dropped resulting in very jerky motion.
Because the eink updates are the limiting factor and we need to allow for up to one second occasional delays in the video, getting the audio to stay in sync adds significant complication. Because framedropping makes the video play time almost exacly match the audio play time, we can just ignore long eink updates and the following dropped frames and let the sound occasionally get up to one second AHEAD of the video (resynchronizing after framedrops), or we can make every audio frame match the video frame just written to the framebuffer (which would require sometimes skipping or repeating one or more audio buffers. The key to making that less annoying is to make the audio buffers play for exactly 130 msec as well, and pack multiple audio buffers into each audio frame. Using variable-compression audio formats would complicate this.
Because we can use predictable A-Law logarithmic fixed compression on the audio, we can use 8000 bytes/sec audio (which is 7 seconds of audio per 60,000 byte audio frame, with room to spare). We only need to insert an audio frame every 7 seconds. We can break that up into 130 msec buffers, to match the video frames. We can play the audio buffer that matches the current video frame. If an audio buffer completes before the next video frame is ready, we can REPLAY the existing audio frame (during complex scene changes). Hopefully, this will be during silent cut scenes or violent noisy action scenes where such audio glitches may go unnoticed. When the video is "catching up" with framedrops, we can also drop those audio frames.
There are more complex ways of changing the audio speed using three-phase ring buffers (with single phase fades across buffer I/O crossover), but we will try the simpler way first.
EDIT: It just occurred to me that it may be simpler to let the audio run without interruption and sync the video to it. In other words, whenever the eink update returns, it just plays whatever video frame is the closest match to the audio, even if that means dropping multiple frames to get back in sync. The simplest code may be to let the AUDIO thread advance the NEXT VIDEO frame counter. Then the video can just use that frame counter (or pointer), instead of dropping frames by itself. That would guarantee sound sync, without any audible hiccups. It makes the previous idea seem complicated. Things get simpler if you just think about them enough...