Amazon’s Alexa is getting better at recognizing who’s speaking and what they’re speaking about, understanding words through on-device techniques, and leveraging models trained without needing human review. That’s according to automatic speech recognition head Shehzad Mevawalla, who spoke with VentureBeat ahead of a keynote address at this year’s Interspeech conference.
Alexa is now running “full-capability” speech recognition on-device, after previously relying on models many gigabytes in size that required huge amounts of memory and ran on servers in the cloud. That change is because of a move to end-to-end models, Mevawalla said, or AI models that take acoustic speech signals as input and directly output transcribed speech. Alexa’s previous speech recognizers had specialized components that processed inputs in sequence, such as an acoustic model and a language model.
“With an end-to-end model, you end up getting away from having these separate pieces and end up with a combined neural network,” Mevawalla said. “You’re going from gigabytes down to less than 100MB in size. That allows us to run these things in very constrained spaces.”
Still, the offline models — which are available first for English in the U.S. — are hardware-constrained in the sense that they need an on-device accelerator to process speech at acceptable speeds. Even though the models themselves are small, they contain millions of parameters — variables internal to the models that shape their predictions — that must be computed through matrix multiplications, one of the key operations in deep neural networks. Amazon’s solution is the AZ1 Neural Edge processor, which was developed in collaboration with MediaTek and is built into the latest Echo, Echo Dot, Echo Dot with Clock, Echo Dot Kids Edition, and Echo Show 10.
“The AZ1 basically helps with those matrix multiplication operations and offloads the limited processor. We now have a model that runs on the device that actually has the same or better accuracy than what runs in the cloud,” Mevawalla said.
Alexa’s speaker ID function, which recognizes who is speaking to personalize responses, has also moved to an end-to-end machine learning model. It’s a two-algorithm approach that combines text-dependent and text-independent models. The text-dependent model knows what users are saying ahead of time so it can match it, while the text-independent model matches voices independent of what’s being said.
Improved speaker ID bolsters Natural Turn Taking, a feature that lets multiple people join conversations with Alexa without having to use a wake word for every utterance. Three models run in parallel to drive Natural Turn Taking, which will only be available in English when it launches next year. One distinguishes background speech and noise from commands intended for Alexa. The second uses speech recognition to convert speech into text so it can be analyzed at the sub-word level. The third uses the signal from a device camera (if available) to make a decision about whether what’s being spoken is being directed at the device.
In the case of Echo devices with a camera, Natural Turn Taking can use the camera to detect where a person is looking, whether at another person or at the device. Video and speech are processed locally, and neural networks fuse and decide whether the speech was intended for Alexa. Natural Turn Taking doesn’t require devices with a camera, but it’s more accurate on camera-equipped devices.
At a higher level, Mevawalla says Alexa has become more accurate with regard to speech recognition through the process of fine-tuning. Alexa leverages a “teacher” model that’s trained on millions of hours of data that attunes it to a range of acoustic conditions, language variability, and accents. This model is then tailored to understand the vernacular of a particular region or language. As Mevawalla notes, different countries have different backgrounds, noise conditions, and speaking styles.
“Alexa has tens of millions of devices out there, and with that kind of scale, it’s definitely a challenge … The volumes of data that we can process are something we’ve enhanced over the last year,” Mevawalla said, adding that his team has measured accuracy improvements of up to 25%. “Language pooling … is another technique that we have leveraged very effectively. And that’s completely non-reviewed, unannotated data that a machine transcribed.”
The audio problem: Learn how new cloud-based API solutions are solving imperfect, frustrating audio in video conferences. Access here