Researchers find high error rates in commercial speech recognition systems

October 22, 2020

Some automatic speech recognition (ASR) systems might be less accurate than previously assumed. That’s the top-level finding of a recent study by researchers at Johns Hopkins University, the Poznan University of Technology in Poland, the Wrocław University of Science and Technology, and startup Avaya, which benchmarked commercial speech recognition models on an internally created dataset. The coauthors claim that the word error rates (WER) — a common speech recognition performance metric — were significantly higher than the best-reported results and that this could indicate a wider-ranging problem in the field of natural language processing (NLP).

ASR has become ubiquitous; it dictates meetings and emails, helps to manage smart appliances, and more. A comprehensive benchmark of ASR models cites WER as low as 2% to 3% on standard corpora, but the coauthors of this latest report reject that statistic. The majority of interactions with ASRs happen in the context of “chatbot-like interactions,” they claim, where people are aware they’re conversing with a machine and thus simplify their commands to short, well-structured phrases as opposed to the disfluent hallmarks of natural conversation.

The coauthors evaluated several ASR systems on a dataset of 50 call center conversations from 1,595 agents and 1,261 customers, which spanned 8.5 hours in length — 2.2 hours of which was speech. Depending on the dataset, the ASR systems’ previously published error rates didn’t exceed 15% and dropped as low as 2%. This was in contrast with the study’s findings; tested across recorded phone conversations about finance, insurance, telecom, and booking, the coauthors observed WER as high as 23.31%. The highest rates were on the booking and telecom calls, perhaps because the conversations referred to specific dates and times, money, places, and product and company names. But WER was above 13.73% in every domain.

The researchers attribute the disparity to the simplicity of frequently used benchmarks like Librispeech (1,000 hours of English audiobook recordings), WSJ (dictations and conversations from journalists), and Switchboard (phone exchanges), which they say might be too simple to truly challenge ASRs. Even more holistic benchmarks suffer from the “domain adaptation problem” — while they attempt to mimic real, spontaneous conversations, they’re inherently artificial because they involve pairs of voice actors having a conversation on subjects drawn from agreed-upon topics. Other benchmark datasets come from scripted or semi-scripted conversations like TED Talks. Moreover, the datasets tend to be homogeneous with respect to voice actor demographics. Non-native language speakers are virtually absent from benchmark datasets, and factors like pronunciation, linguistics, and gender often aren’t accounted for.

“Benchmark datasets do not represent the true diversity of real-world conversations, both at input signal characteristics and conversation semantics levels,” the coauthors wrote. “The domain of application imposes strict constraints on the vocabulary and the form of the conversations … There are consequential differences between scripted and spontaneous conversations and they affect the results of the ASR evaluation.”

As a remedy, the researchers suggest the ASR and NLP communities collect and annotate audio datasets better aligned with contemporary applications of ASR systems. They also call for work on extended and more inclusive acoustic models representing a broader spectrum of dialects, as well as models that account for technological advances that influence physical properties of processed audio signals.

“These problems are not insurmountable. A thoughtful collaboration between academia and industry partners can lead to the creation of high-quality training and testing datasets,” the researchers continued. “We believe that the overly optimistic perception of ASR accuracy is detrimental to the development of conversational natural language processing downstream applications.”

The audio problem: Learn how new cloud-based API solutions are solving imperfect, frustrating audio in video conferences. Access here

By VentureBeat Source Link

Researchers find high error rates in commercial speech recognition systems

LEAVE A REPLY Cancel reply

CYBER SECURITY NEWS

Online Safety Tips and free Cyber Safety and Crimes books

The National Cyber Crime Reporting Portal

Protect your online accounts from hackers and enable 2SV

Gartner Identifies Top Commercial Threats Facing Sales Leaders in 2025

Email Scams: Understanding, Identifying, and Protecting Yourself

Surge in long-lasting attacks: 35% exceeded one-month duration in 2024

TECH NEWS

High-performance computing, with much less code

Generative and agentic AI set to transform customer service into a strategic value driver for businesses

Generative AI and Machine Learning Set for Continued Investment

Gartner Identifies Top Supply Chain Technology Trends for 2025

Tech CEOs Must Take Several Mitigating Actions to Address Pitfalls

Telcos become part of expanding cloud ecosystem for enterprise digital transformations, says GlobalData

TOP NEWS

The National Cyber Crime Reporting Portal

Over 140,000 Tonnes of CO₂ Emissions Prevented by Uplink Community in 2023-2024

The Art and Science of Cryptography: Securing the Digital World

Automotive dealers need to adapt to technological advancements to remain competitive, says GlobalData

Cryptocurrency Scams: Understanding the Risks and How to Stay Safe

The Evolution of Remote Work: Transforming Business in the 21st Century

TECH NEWS & UPDATES

I invested in a subscription-free door lock, and it’s paying off for my smart...

Oppo Find X8s, Find X8+ Specifications Leaked; Said to Arrive With Dimensity 9400+ Chip

This opportunity is bigger than AI and missing out will cost you

iPhone 17 Pro, iPhone 17 Pro Max Glass-Aluminium Rear Panel Design Spotted on New...

Samsung Galaxy S25 Edge, Galaxy Tab S10 FE Surface on Geekbench Ahead of Debut

Researchers find high error rates in commercial speech recognition systems

RELATED ARTICLES

LEAVE A REPLY Cancel reply

CYBER SECURITY NEWS

TECH NEWS

TOP NEWS

TECH NEWS & UPDATES