Facebook Wav2vec-U learns to recognize speech from unlabeled data

May 21, 2021

Elevate your enterprise data technology and strategy at Transform 2021.

Facebook today announced that it trained an AI model to build speech recognition systems that don’t require transcribed data. The company, which trained systems for Swahili, Tatar, Kyrgyz, and other languages, claims that the model, wav2vec Unsupervised (Wav2vec-U), is an important step toward building machines that can solve a range of tasks by learning from their observations.

AI-powered speech transcription platforms are a dime a dozen in a market estimated to be worth over $1.6 billion. Deepgram and Otter.ai build voice recognition models for cloud-based real-time processing, while Verbit offers tech not unlike that of Oto, which combines intonation with acoustic data to bolster speech understanding. Amazon, Google, Facebook, and Microsoft offer their own speech transcription services.

But the dominant form of AI for speech recognition falls into a category known as supervised learning. Supervised learning is defined by its use of labeled datasets to train algorithms to classify data and predict outcomes, which, while effective, is time-consuming and expensive. Companies have to obtain tens of thousands of hours of audio and recruit human teams to manually transcribe the data. And this same process has to be repeated for each language.

Unsupervised speech recognition

Facebook’s Wav2vec-U solves the challenges in supervised learning by taking a self-supervised (also known as unsupervised) approach. With unsupervised learning, Wav2vec-U is fed “unknown” data for which no previously defined labels exist. The system must teach itself to classify the data, processing it to learn from its structure.

While relatively underexplored in the speech domain, a growing body of research demonstrates the potential of learning from unlabeled data. Microsoft is using unsupervised learning to extract knowledge about disruptions to its cloud services. More recently, Facebook itself announced SEER, an unsupervised model trained on a billion images that achieves state-of-the-art results on a range of computer vision benchmarks.

Wav2vec-U learns purely from recorded speech and text, eliminating the need for transcriptions. Using a self-supervised model and Facebook’s wav2vec 2.0 framework as well as what’s called a clustering method, Wav2vec-U segments recordings into units that loosely correspond to particular sounds.

To learn to recognize words in a recording, Facebook trained a generative adversarial network (GAN) consisting of a generator and a discriminator. The generator takes audio segments and predicts a phoneme (i.e., unit of sound) corresponding to a sound in language. It’s trained by trying to fool the discriminator, which assesses whether the predicted sequences seem realistic. As for the discriminator, it learns to distinguish between the speech recognition output of the generator and real text from examples of text from sources that were “phonemized,” in addition to the output of the generator.

While the GAN’s transitions are initially poor in quality, they improve with the feedback of the discriminator.

“It takes about half a day — roughly 12 to 15 hours on a single GPU — to train an average Wav2vec-U model. This excludes self-supervised pre-training of the model, but we previously made these models publicly available for others to use,” Facebook AI research scientist manager Michael Auli told VentureBeat via email. “Half a day on a single GPU is not very much, and this makes the technology accessible to a wider audience to build speech technology for many more languages of the world.”

To get a sense of how well Wav2vec-U works in practice, Facebook says it evaluated it first on a benchmark called TIMIT. Trained on as little as 9.6 hours of speech and 3,000 sentences of text data, Wav2vec-U reduced the error rate by 63% compared with the next-best unsupervised method.

Wav2vec-U was also as accurate as the state-of-the-art supervised speech recognition method from only a few years ago, which was trained on hundreds of hours of speech data.

Future work

AI has a well-known bias problem, and unsupervised learning doesn’t eliminate the potential for bias in a system’s predictions. For example, unsupervised computer vision systems can pick up racial and gender stereotypes present in training datasets. Some experts, including Facebook chief scientist Yann LeCun, theorize that removing these biases might require a specialized training of unsupervised models with additional, smaller datasets curated to “unteach” specific biases.

Facebook acknowledges that more research must be done to figure out the best way to address bias. “We have not yet investigated potential biases in the model. Our focus was on developing a method to remove the need for supervision,” Auli said. “A benefit of the self-supervised approach is that it may help avoid biases introduced through data labeling, but this is an important area that we are very interested in.”

In the meantime, Facebook is releasing the code for Wav2vec-U in open source to enable developers to build speech recognition systems using unlabeled speech audio recordings and unlabeled text. While Facebook didn’t use user data for the study, Auli says that there’s potential for the model to support future internal and external tools, like video transcription.

“AI technologies like speech recognition should not benefit only people who are fluent in one of the world’s most widely spoken languages. Reducing our dependence on annotated data is an important part of expanding access to these tools,” Facebook wrote in a blog post. “People learn many speech-related skills just by listening to others around them. This suggests that there is a better way to train speech recognition models, one that does not require large amounts of labeled data.”

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

up-to-date information on the subjects of interest to you
our newsletters
gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
networking features, and more

Become a member

Source Link

Facebook Wav2vec-U learns to recognize speech from unlabeled data

Unsupervised speech recognition

Future work

VentureBeat

LEAVE A REPLY Cancel reply

CYBER SECURITY NEWS

Online Safety Tips and free Cyber Safety and Crimes books

The National Cyber Crime Reporting Portal

Protect your online accounts from hackers and enable 2SV

Gartner Identifies Top Commercial Threats Facing Sales Leaders in 2025

Email Scams: Understanding, Identifying, and Protecting Yourself

Surge in long-lasting attacks: 35% exceeded one-month duration in 2024

TECH NEWS

High-performance computing, with much less code

Generative and agentic AI set to transform customer service into a strategic value driver for businesses

Generative AI and Machine Learning Set for Continued Investment

Gartner Identifies Top Supply Chain Technology Trends for 2025

Tech CEOs Must Take Several Mitigating Actions to Address Pitfalls

Telcos become part of expanding cloud ecosystem for enterprise digital transformations, says GlobalData

TOP NEWS

The National Cyber Crime Reporting Portal

Over 140,000 Tonnes of CO₂ Emissions Prevented by Uplink Community in 2023-2024

The Art and Science of Cryptography: Securing the Digital World

Automotive dealers need to adapt to technological advancements to remain competitive, says GlobalData

Cryptocurrency Scams: Understanding the Risks and How to Stay Safe

The Evolution of Remote Work: Transforming Business in the 21st Century

TECH NEWS & UPDATES

Simplilearn Professional Sentiment Survey Reveals 92 Percent See GenAI as Key to Career Growth...

8 Future Trends of AI in Healthcare

I love the Galaxy S25 Ultra, but the Pixel 9 Pro XL for $200...

WhatsApp Reportedly Working on Support for Motion Photos on Android

Honor Pad X9a With 11.5-inch LCD Screen, Snapdragon 685 SoC Launched

Facebook Wav2vec-U learns to recognize speech from unlabeled data

Unsupervised speech recognition

Future work

VentureBeat

RELATED ARTICLES

LEAVE A REPLY Cancel reply

CYBER SECURITY NEWS

TECH NEWS

TOP NEWS

TECH NEWS & UPDATES