Meta, owner of Facebook, Instagram, and WhatsApp, on Tuesday unveiled its latest effort in machine translation, this one geared toward speech translation.
The program, SeamlessM4T, surpasses existing models that are trained specifically for speech-to-speech translation between languages, as well as models that convert between speech and text in multiple language pairs. Hence, SeamlessM4T is an example not just of generality but of what is called multi-modality — the ability for one program to operate on multiple data types, in this case, both speech and text data.
Also: Meta to release open-source commercial AI model to compete with OpenAI and Google
Previously, Meta has focused on large language models that can translate text between 200 different languages. That focus on text is a problem, say lead author Loïc Barrault and colleagues at both Meta and UC California at Berkeley.
“While single, unimodal models such as No Language Left Behind (NLLB) push text-to-text translation (T2TT) coverage to more than 200 languages, unified S2ST [speech-to-speech-to-text] models are far from achieving similar scope or performance,” write Barrault and team.
The formal paper, “SeamlessM4T — Massively Multilingual & Multimodal Machine Translation,” is posted on Meta’s dedicated site for the overall project, Seamless Communication. There is also a companion GitHub site.
Speech has been left behind partly because less speech data is readily available in the public domain to train neural networks, write the authors. But there’s a deeper point: Speech data is fundamentally richer as a signal for neural networks.
“The very challenge around why speech is harder to tackle from a machine translation standpoint — that it encodes more information and expressive components — is also why it is superior at conveying intent and forging stronger social bonds between interlocutors,” they write.
The goal of SeamlessM4T is to create one program that is trained on both speech data and text data at the same time. The “M4T” stands for “Massively Multilingual & Multimodal Machine Translation.” Multi-modality is an explicit part of the program.
Also: Meta’s latest AI model will make content available in hundreds of languages
Such a program is sometimes referred to as an “end-to-end” program because it doesn’t break up the parts that are about text and the parts that are about speech into separate functions, as in the case of “cascaded models,” where the program first is trained on one thing, such as speech to text, and then another thing, such as speech to speech.
As the program’s authors put it, “most S2ST [speech-to-speech translation] systems today rely heavily on cascaded systems composed of multiple subsystems that perform translation progressively — e.g., from automatic speech recognition (ASR) to T2TT [text-to-text translation], and subsequently text-to-speech (TTS) synthesis in a 3-stage system.”
Instead, the authors built a program that combines multiple existing parts trained together. They included “SeamlessM4T-NLLB a massively multilingual T2TT model,” plus a program called w2v-BERT 2.0, “a speech representation learning model that leverages unlabeled speech audio data,” plus T2U, “a text-to-unit sequence-to-sequence model,” and multilingual HiFi-GAN, a “unit vocoder for synthesizing speech from units.”
Also: Meta’s ‘data2vec’ is a step toward One Neural Network to Rule Them All
All four components are plugged together like a Lego set into a single program, also introduced this year by Meta, called UnitY, which can be described as “a two-pass modeling framework that first generates text and subsequently predicts discrete acoustic units.”
The whole organization is visible in the diagram below.
The program manages to do better than multiple other kinds of programs on tests of speech recognition, speech translation, and speech-to-text, the authors report. That includes beating both taint programs that are also end-to-end, as well as programs designed for speech explicitly:
We find that SeamlessM4T-Large, the larger model of the two we release, outper- forms the previous state-of-the-art (SOTA) end-to-end S2TT model (AudioPaLM-2-8B- AST [Rubenstein et al., 2023]) by 4.2 BLEU points on Fleurs [Conneau et al., 2022] when translating into English (i.e., an improvement of 20%). Compared to cascaded mod- els, SeamlessM4T-Large improves translation accuracy by over 2 BLEU points. When translating from English, SeamlessM4T-Large improves on the previous SOTA (XLS- R-2B-S2T [Babu et al., 2022]) by 2.8 BLEU points on CoVoST 2 [Wang et al., 2021c], and its performance is on par with cascaded systems on Fleurs. On the S2ST task, SeamlessM4T-Large outperforms strong 3-stage cascaded models (ASR, T2TT and TTS) by 2.6 ASR-BLEU points on Fleurs. On CVSS, SeamlessM4T-Large outperforms a 2-stage cascaded model (Whisper-Large-v2 + YourTTS [Casanova et al., 2022]) by a large margin of 8.5 ASR-BLEU points (a 50% improvement). Preliminary human evalua- tions of S2TT outputs evinced similarly impressive results. For translations from English, XSTS scores for 24 evaluated languages are consistently above 4 (out of 5); for into English directions, we see significant improvement over Whisper-Large-v2’s baseline for 7 out of 24 languages.
Also: Google’s ‘translation glasses’ were actually at I/O 2023, and right in front of our eyes
The companion GitHub site offers not just the program code but also SONAR, a new technology for “embedding” multi-modal data, and BLASAR 2.0, a new version of a metric by which to automatically evaluate multi-modal tasks.
Artificial Intelligence