© Photo by Dan Seifert / The VergeThe new speaking style should be arriving on Alexa-enabled devices in the coming weeks. |
By James Vincent, The Verge
Amazon’s Alexa continues to learn new party tricks, with the latest being a “newscaster style” speaking voice that will be launching on enabled devices in a few weeks’ time.
You can listen to samples of the speaking style below, and the results, well, they speak for themselves. The voice can’t be mistaken for a human, but it does incorporates stresses into sentences in the same way you’d expect from a TV or radio newscaster. According to Amazon’s own surveys, users prefer it to Alexa’s regular speaking style when listening to articles (though getting news from smart speakers still has lots of other problems).
Amazon says the new speaking style is enabled by the company’s development of “neural text-to-speech” technology or NTTS. This is the next generation of speech synthesis, that use machine learning to generate expressive voices more quickly. Currently, Alexa uses uses concatenative speech synthesis, a method that’s been around for decades. This involves breaking up speech samples into distinct sounds (known as phonemes) and then stitching them back together to form new words and sentences.
Here’s how the voices compare:
Concatenative speech synthesis can produce surprisingly good results, but new AI-infused methods are overtaking fast. Last October, Google launched a new form of speech synthesis for Google Assistant that uses machine learning techniques developed by its London-based AI lab DeepMind. Amazon tells The Verge that Alexa should be switching to neural text-to-speech synthesis (complete with newscaster voice) “in the coming weeks.”
The newscaster speaking voice was created by recording audio clips from real life news channels then using machine learning to spot patterns in how newscasters read the text. Speaking to The Verge, Amazon’s Trevor Wood, who oversees the application of AI in text-to-speech at Amazon, said this approach more easily captures the detail in human speaking styles. “It’s difficult to describe these nuances precisely in words, and a data-driven approach can discover and generalize these more efficiently than a human,” said Wood.
Notably, Amazon says it only took a few hours of data to teach Alexa the newscaster speaking voice, suggesting that a whole range of styles could be easily incorporated in the future. So far, Amazon has already added a whisper mode for Alexa, and after the upgrade to NTTS in the coming weeks we can probably expect a panoply of voices in 2019.