“We’re trying to prove to consumers that there is something to 5G that makes it different and better than a 4G network,” said Jay Cary, vice president of 5G product and mobility innovation for AT&T. “It has massive computing power, higher speeds and lower latency. This felt like a really amazing way to bring the potential of the network and the technology to life.”
Bugs Bunny is the first animated character AT&T has brought to life with Custom Neural Voice, but it likely won’t be the last. Cary becomes quite animated himself as he talks about the possibilities: characters coming to life from the cereal box, reading you stories, watching cartoons alongside you or showing you around the neighborhood.
“We love that idea of blending the physical environment and the virtual environment,” he said.
To create the custom voice, an approved Bugs voice actor came into the studio to record about 2,000 phrases and lines, with guidance from the Microsoft team, Cary said.
The Warner Bros. team – “the Bugs Bunny experts,” Cary calls them – then worked with the Microsoft team to iterate on the voice, making sure it accurately reflects Bugs Bunny’s personality and all his inflections.
“We wanted to make sure it really represented what Bugs felt like in the real world,” Cary said. “It feels like a natural speed, real-life conversation you might have with a friend. It feels very real.”
Unreal transparency
A conversation with Bugs Bunny might feel real, but everyone knows that it isn’t – because Bugs is a fictional character. That’s an important distinction, and one that Microsoft is careful to protect in every application of the technology. That’s a key reason Custom Neural Voice is limited access, meaning interested customers must apply and be approved by Microsoft to use the technology. In this case, general availability means it is ready for production and available in more Azure cloud regions, not that it is available to the general public.
While many uses for Custom Neural Voice involve a fictional character, sometimes a customer wants the voice to be a real person, such as an author reading their own book. Even in those cases, it is important that people know the voice is synthetic, which is why Microsoft includes a disclosure requirement in its contract.
“We require customers to make very clear it’s a synthetic voice or, when it’s not immediately obvious in context, that they explicitly disclose it’s synthetic in a way that’s perceivable by users and not buried in terms,” said Sarah Bird, Responsible AI lead for Cognitive Services within Azure AI.
Another fictional voice that neural text-to-speech is bringing to life is Flo, the longtime brand icon for Progressive Insurance.
A few years ago, the company launched a Flo chatbot in Facebook messenger, complete with the sunny personality and quirky witticisms that customers have come to expect from the salesperson character played by Stephanie Courtney in TV ads since 2008. When the company started to explore the potential of using a voice conversation to interact with customers, Flo was the natural choice.
“One of Progressive’s core interest areas is we want to make our brand and products available wherever and whenever people want,” said Matt White, technology and innovation manager in Progressive’s acquisition experience group. “That’s why we put Flo in Facebook Messenger, and that’s why we started to explore what’s possible with voice and smart speakers.”
Progressive was already using Azure AI technology to power the chatbot, and it made sense to layer the neural text-to-speech service on top, White said.
The general availability of Custom Neural Voice includes technical controls to help prevent misuse of the service. As part of the voice recording script a customer submits to create the custom voice, the voice actor makes a statement acknowledging that they understand the technology and are aware that the customer is having a Custom Neural Voice made. That recording is compared with the training data using speaker verification technology to make sure the voices match before a customer can begin training the voice. Microsoft also contractually requires customers to get consent from voice talent.
“We did a number of studies and had interactions with the voice acting industry and ethicists in the field to come up with sets of guidelines and ways we want to make sure this technology is used,” Boyd said.
A commitment to responsibility
Contractual terms, limiting access to approved customers and performing speaker verification on audio files are three ways Microsoft is safeguarding against misuse of the technology. Bird’s role within Microsoft is to help develop protocols and support teams in responsibly developing features and products within Azure Cognitive Services, as well as empowering customers to use them responsibly.
“We really want to demonstrate how we can create these technologies that have this positive impact while making sure that we’re not causing harm in the world,” Bird said.
Microsoft conducts impact assessments to determine potential risks. Once risks have been identified, features and processes are created to address them. In the case of Custom Neural Voice, such safeguards include the review process for each potential use case, a code of conduct and the verification comparing voice talent acknowledgement files against training audio files.
Bird said the team is also working on a way to embed a digital watermark within a synthetic voice to indicate that the content was created with an Azure Custom Neural Voice.
Such technical and policy features are in line with Microsoft’s commitment to responsible AI. That commitment includes Transparency Notes, which communicate the purposes, capabilities and limitations of an AI system.
“As creators of this technology, we have an obligation to make sure it’s used responsibly,” Boyd said. “We take responsible AI very seriously; it’s one of our core tenets. And we’re careful with the partners we work with in making sure they follow the guidelines.”
Building a custom voice
So how do a bunch of recorded phrases become a natural-sounding voice that can say anything?
Recordings are used to create a font of sounds, or phonemes. It’s somewhat similar to a font on a computer containing letters and characters that you combine to make words and sentences.
But neural text-to-speech goes way beyond piecing together sounds to form words.
“The real technology breakthrough is the efficient use of deep learning to process the text to make sure the prosody and pronunciation is accurate,” said Xuedong Huang, a Microsoft technical fellow and the chief technology officer of Azure AI Cognitive Services. “The prosody is what the tone and duration of each phoneme should be. We combine those in a seamless way so they can reproduce the voice that sounds like the original person.”