Bridging the Gap — Human & AI Voice Synthesis in Speech and Music

I have long been fascinated by the promise of synthetic voice development for speech and for music. As a would-be record producer in the 1980s, and putting myself through college working at Keyboard Magazine, I became enamored with tools like the Moog synthesizer, the vocoder (voice encoder) and the Talk Box* (an effects pedal made famous by Joe Walsh and Peter Frampton in the mid-70s) — and later Auto-Tune by Antares which would all by degrees impact music by allowing forward thinking artists to modulate their voices creating new effects.

The invention of the vocoder had nothing to do with music initially; it was developed by Bell Labs in 1928 in order to transmit human speech signals under water through the trans-atlantic cable (voice compression to save bits) as military technology to secure and encrypt voices and mitigate code-breaking — however the earliest experiments were pretty rudimentary resulting in garbled speech.

Some 50 years later, hip hop artists and modern artists embracing synth technology (like Kraftwerk, Daft Punk and Herbie Hancock) would experiment with the vocoder as an instrument in their work as the worlds of music and computer technology began to converge.

The *Talk Box was an effects pedal that would direct sound from the instrument into the musician’s mouth by means of a plastic tube adjacent to their vocal microphone. The musician controlled the modification of the instrument’s sound by changing the shape of their mouth, “vocalizing” the instrument’s output into a microphone.

Auto-Tune is an audio processor launched in 1997 by Antares; it was developed to correct off-key pitch for professional singers. Savvy artists and producers would later play with it to distort or modulate vocals — most famously epitomized by Cher in the 1998 hit song “Believe”.

The artist, songwriter and producer Mark Ronson has a brand new series out entitled “Watch the Sound” airing now on AppleTV+ which does a great job with a deep dive on many of these technologies; I highly recommend it if this is a subject that you’re keen to learn more about as it relates to music production.

Voice synthesis has also played an instrumental role (pun intended!) to support individuals with disabilities that have impacted their vocal cords — impacting their ability to speak.

Probably the most famous example of this is Stephen Hawking who lost his ability to speak as a result of a botched tracheotomy in the mid-1980s. There’s a fabulous in-depth story in Wired Magazine about how Intel Labs harnessed state-of-the-art technology to provide a breakthrough solution for Hawking with many tweaks over several years that ultimately allowed him to communicate far more effectively in later years.

More recently (in fact just a few days ago!), there was a breakthrough in voice synthesis for actor Val Kilmer who lost his voice after undergoing treatment for throat cancer several year ago. This new technology was developed by a company called Sonantic — who develop lifelike performances for films and games with fully expressive AI-generated voices.

One of the biggest challenges in voice synthesis has been to mitigate the robotic effect computer-generated voice synthesis has historically propagated with a more lifelike approach that includes unique inflection, pitch, tonality and emotional nuances in the voice that make the result feel much more human and compelling.

Kilmer worked closely with Sonantic running past recordings of his voice through their simulation, and when you hear the result it’s astonishing how far the technology has come over the decades (!). Do give it a listen; it’s truly amazing.

Of course many of us have become used to artificial voice assistants like Siri and Alexa in our every day lives — and they will continue to evolve as AI and machine learning progress — to seem more human.

If this is a topic that appeals to you — I also recommend two other articles I came across in putting this piece together that afford a deeper dive. Check out “The Exciting Future of Voice Synthesis Technology” (which among other things includes a look at some of Google’s initiatives in this space — WaveNet, using AI & Machine Learning), and a recent article in Scientific American entitled “Artificial Intelligence Is Now Shockingly Good at Sounding Human” which covers this subject in great detail and is worthy of a read.

KELLI RICHARDS is a seasoned ’super-connector’, a trusted advisor & a strategic bus dev exec bridging innovators & creatives. Learn more at “kellirichards.com