The Apple-II just had one-bit sound (both input and output). Digitized voice was a bit rough when only using clipped zero-cross detection (but significantly improved if differentiated it before zero-cross detection), but still fully intelligible if carefully enunciated (even without differentiation). I read and recorded onto cassette tape a "prototypical" list of phoneme-bearing words, digitized the tape using the Apple-II cassette tape input, then clippout out the phonemes and stored them into an indexed-table. I used a published algorithm for (mostly) acceptable English-to-phoneme generation, and I added a pre-lookup table for obvious pronunciation errors. Anything not in the table went to the algorithm. It worked pretty well, considering the minimal technology. People have done such things recently on Arduinos, and even the lowliest Kindle is one heckuva lot more powerful than an Arduino. So anybody claiming something cannot be done only shows their complete lack of real-world knowlege and/or imagination, OR they are completely P-whipped by the legal and/or marketing departments of the folks who sign their paycheck. And folks who bend over to "authority" figures to question old hacks like us? Well, really now... Ya just gotta wonder...
Regarding driving an LED with PWM sound, I wonder if a dollar-store solar garden light photocell could drive a dollar store headset, when driven by an LED? Pretty darned quiet I think, at best. Yeah, amplified speakers (for those folks who still have them).