Can you tell a human from a bot? In one survey, AI voice services creator Podcastle found that two out of three people incorrectly guessed whether a voice was human or AI-generated. That means that AI voices are becoming harder and harder to distinguish from the voices of real people.
Also: How do AI checkers actually work?
For businesses who might want to rely on artificial voice generation, that’s promising. For the rest of us, it’s a bit terrifying.
Voice synthesis isn’t new
Many AI technologies date back decades. But in the case of voice, we’ve had speech synthesis for centuries. Yeah. This ain’t new.
For example, I invite you to check out Mechanismus der menschlichen Sprache nebst der Beschreibung seiner sprechenden Maschine from 1791. This paper documented how Johann Wolfgang Ritter von Kempelen de Pázmánd used bellows to create a speaking machine as part of his famous automaton hoax, The Turk. This was the origin of the term “mechanical turk.”
Also: AI Engineering is the next frontier for technological advances: What to know
One of the most famous synthesized voices of all time was WOPR, the computer from the 1983 movie WarGames. Of course, that wasn’t actually computer-synthesized. In the movie’s audio commentary, director John Badham said that actor John Wood read the script backward to reduce inflection, and then the resulting recording was post-processed in the studio to give it a synthetic sound. “Shall. We. Play. A. Game?”
A real text-to-speech computer-synthesized voice gave physicist Stephen Hawking his actual voice. That was built using a 1986 desktop computer fastened to his wheelchair. He never changed it for something more modern. He said, “I keep it because I have not heard a voice I like better and because I have identified with it.”
Speech synthesis chips and software are also not new. The 1980s TI 99/4 had speech synthesis as part of some game cartridges. Mattel had Intellivoice on its Intellivision game console back in 1982. Early Mac fans will probably remember Macintalk, although even the Apple II had speech synthesis earlier.
Also: How I used ChatGPT to scan 170k lines of code in seconds and save me hours of detective work
Most of these implementations, as well as implementations going forward until the mid-2010s, used basic phonemes to create speech. All words can be broken down into about 24 consonant sounds and about 20 vowel sounds. Those sounds were synthesized or recorded, and then when a word needed to be “spoken,” the phonemes were assembled in sequence and played back.
It worked, it was reliable, and it was effective. It just didn’t sound like Alexa or Siri.
Today’s AI voices
Now, with the addition of AI technologies and far greater processing power, voice synthesis can sound like actual voices. In fact, today’s AI voice generation can create voices that sound like people we know, which could be a good or bad thing. Let’s take a look at both.
1. Voice scams
In January, a voice service telecom provider made thousands of fraudulent robocalls using an AI-generated voice sounding like President Joe Biden. The voice told voters that if they voted in the state’s then-upcoming primary, they wouldn’t be allowed to vote in the November general election.
Also: The AI scams infiltrating the knitting and crochet world – and why it matters for everyone
The FCC was not amused. This kind of misrepresentation is illegal, and the voice service provider has agreed to pay $1 million to the government in fines. In addition, the political operative who set up the scam is facing a court case that could result in him owing $6 million to the government.
2. Content creation (and more voice scams)
This process is called voice cloning, and it has both practical and nefarious applications. For example, video-editing service Descript has an overdub capability to clone your voice. Then, if you make edits to a video, it can dub your voice over your edits, so you don’t have to go back and re-record any changes you make.
Descript’s software will even sync your lip movements to the generated words, so it looks like you’re saying what you type into the editor.
Also: The best AI chatbots of 2024: ChatGPT, Copilot, and worthy alternatives
As someone who spends way too much time editing and re-shooting video mistakes, I can see the benefit. But I can’t help but picture the evil this technology can also foster. The FTC has a page detailing how scammers use fake text messages to perpetrate a fake emergency scam.
But with voice cloning and generative AI, Mom might get a call from Jane — and it really sounds like Jane. After a short conversation, Mom ascertains that Jane is stranded in Mexico or Muncie and needs a few thousand dollars to get home. It was Jane’s voice, so Mom sent the cash. As it turns out, Jane is just fine and completely unaware of the scam attacking her mother.
Now, add in lip-synching. You can totally predict the rise in fake kidnapping scams demanding ransom payments. I mean, why actually take the risk of capturing a student traveling abroad (especially since so many traveling students post to social media while traveling) when a completely fake video would do the trick?
Also: Did you get a fake McAfee or Norton invoice? How the scam works
Does it work all the time? No. But it doesn’t have to. It’s still scary.
3. Accessibility aids
But it’s not all doom and gloom. While nuclear research brought about the bomb, it also paved the way for nuclear medicine, which has helped save countless lives.
Also: 7 Android accessibility features that can make your life easier
As that old 1986 PC gave Professor Hawking his voice, modern AI-based voice generation is helping patients today. NBC has a report on technology being developed at UC Davis that is helping provide an ALS patient with the ability to speak.
The project uses a range of technologies, including brain implants that process neural patterns, AI that converts those patterns into the words the patient wants to say, and an AI voice generator that speaks in the actual voice of the patient. The ALS patient’s voice was cloned from recordings that were made of his voice before the disease took away his ability to speak.
4. Voice agents for customer service
AI in call centers is a very fraught topic. Heck, the very topic of call centers is fraught. There’s the impersonal feeling you get when you have to work your way through a “press 1 for whatever” call tree. There’s the frustration of waiting another 40 minutes to reach an agent.
Then there’s the frustration of dealing with an agent who is clearly not trained or is working from a script that doesn’t address your issue. There’s also the frustration that arises when you and the agent can’t understand each other because of your respective accents or depth of language understanding.
Also: The best free AI courses in 2024
And how many times have you been disconnected when a first-level agent couldn’t successfully transfer you to a manager?
AI in call centers can help. I was recently dumped into an AI when I needed to solve a technical problem. I’d already filed a help ticket — and waited a week for a fairly unhelpful response. Human voice support wasn’t available. Out of frustration and a tiny bit of curiosity, I finally decided to click the “AI Help” button.
As it turns out, it was a very well-trained AI, able to answer fairly complex technical questions and understand and implement the configuration changes my account needed. There was no waiting, and my issue, which had festered for more than a week, was solved in about 15 minutes.
Another example is Fair Square Medicare. The company uses voice assistants to help seniors choose the right medicare plan. Medicare is complex, and choices are not obvious. Seniors are often overwhelmed by their choices and struggle with impatient agents. But Fair Square has built a generative AI voice platform built on GPT-4 that can guide seniors through the process, often without long waits.
Also: How I test an AI chatbot’s coding ability – and you can, too
Sure, it’s sometimes nice to be able to talk to a human. But if you’re unable to get connected to a knowledgeable and helpful human, an AI may well be a viable alternative.
5. Intelligent assistants
Next up are the intelligent assistants like Alexa, Google, and Siri. For these products, voice is essentially the entire product. Siri, when it first hit the market in 2011, was amazing in terms of what it could do. Alexa, back in 2014, was also impressive.
While both products have evolved, improvements have been incremental over the years. Both added some level of scripting and home control, but the AI elements seemed to have stagnated.
Also: This AI model lets you generate videos using only your photos
Neither can match ChatGPT‘s voice chat capabilities, especially when running ChatGPT Plus and GPT-4o. While Siri and Alexa both have home automation capabilities and standalone devices that can be initiated without a smartphone, ChatGPT’s voice assistant version is astonishing.
It can maintain full conversations, pull on answers (albeit sometimes made up) that go beyond the stock “According to an Alexa Answers contributor,” and follow conversational guidelines.
While Alexa’s (and, to a lesser extent, Siri and Google Assistant’s) voice quality is good, ChatGPT’s vocal intonations are more nuanced. That said, I personally find ChatGPT almost too friendly and cheerful, but that could be just me.
Also: Midjourney’s AI-image generator website is now officially open to everyone – for free
Of course, one other standout capability of voice assistants is voice recognition. These devices have an array of microphones that allow them to not only distinguish human voices from background noise but also to hear and process human speech, at least enough to create responses.
How AI voice generation works
Fortunately, most programmers don’t have to develop their own voice generation technology from scratch. Most of the major cloud players offer AI voice generation services that operate as a microservice or API from your application. These include Google Cloud Text-to-Speech, Amazon Polly, Microsoft’s Azure AI Speech, Apple’s speech framework, and more.
Also: How does ChatGPT actually work?
In terms of functionality, speech generators start with text. That text might be generated by a human writer or by an AI like ChatGPT. This text input will then be converted into human language, which is fundamentally a set of audio waves that can be heard by the human ear and microphones.
We talked about phonemes earlier. The AIs process the generated text and perform phonetic analysis, producing speech sounds that represent the words in the text.
Neural networks (code that processes patterns of information) use deep learning models to ingest and process huge datasets of human speech. From these millions of speech examples, the AI can modify the basic word sounds to reflect intonation, stress, and rhythm, making the sounds seem more natural and holistic.
Some AI voice generators then personalize the output further, adjusting pitch and tone to represent different voices and even applying accents that reflect speech coming from a particular region. Right now, that’s beyond ChatGPT’s smartphone app, but you can ask Siri and Alexa to use different voices or voices from various regions.
Speech recognition functions in reverse. It needs to capture sounds and turn them into text that can then be fed into some processing technology like ChatGPT or Alexa’s back-end. As with voice generation, cloud services offer voice recognition capabilities. Microsoft and Google’s text-to-speech services mentioned above also have voice recognition capabilities. Amazon separates speech recognition from speech synthesis in its Amazon Transcribe service.
The first stage of voice recognition is sound wave analysis. Here, sound waves captured by a microphone are converted into digital signals, roughly the equivalent of glorified WAV files.
Also: The best AI image generators of 2024
That digital signal then goes through a preprocessing stage where background noise is removed, and any recognizable audio is split into phonemes. The AI also tries to perform feature extraction, where frequency and pitch are identified. The AI uses this to help clarify the sounds it thinks are phonemes.
Next comes the model matching phase, where the AI uses large trained datasets to match the extracted sound segments against known speech patterns. These speech patterns then go through language processing, where the AI pulls together all the data it can find to convert the sounds into text-based words and sentences. It also uses grammar models to help arbitrate questionable sounds, composing sentences that make linguistic sense.
And then, all of that is converted into text that’s used either as input for additional systems or transcribed and displayed on screen.
So there you go. Did that answer your questions about AI voice generation, how it’s used and how it works? Do you have additional questions? Do you expect to use AI voice generation either in your normal workflow or your own applications? Let us know in the comments below.
You can follow my day-to-day project updates on social media. Be sure to subscribe to my weekly update newsletter, and follow me on Twitter/X at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, and on YouTube at YouTube.com/DavidGewirtzTV.