Natural Language Processing (NLP) and ambient sound processing are traditionally considered exclusive cloud technologies and this has restricted their adoption in markets where security, privacy, and service continuity are critical elements for deployment. However, the advancements in deep learning compression technologies and edge Artificial Intelligence (AI) chipsets are now enabling these technologies to be integrated at the end-device level, which could mitigate security and privacy concerns while ensuring enabled services can be delivered consistently without interruption. ABI Research, a global tech market advisory firm, estimates over 2 billion end devices will be shipped with a dedicated chipset for ambient sound or natural language processing by 2026.
“NLP and ambient sound processing will follow the same cloud-to-edge evolutionary path as machine vision. Through efficient hardware and model compression technologies, this technology now requires fewer resources and can be fully embedded in end devices,” says Lian Jye Su, Principal Analyst, Artificial Intelligence and Machine Learning at ABI Research. “At the moment, most of the implementations focus on simple tasks, such as wake word detection, scene recognition, and voice biometrics. However, moving forward, AI-enabled devices will feature more complex audio and voice processing applications.”
The popularity of Alexa, Google Assistant, Siri, and various chatbots in the enterprise sector has led to the boom of the voice user interface. In June 2021, Apple announced that Siri would process certain requests and actions offline. Such implementation frees Siri from constant internet connectivity and significantly improves the iPhone user’s experience. ABI Research expects Apple’s competitors, especially Google, with its latest Tensor System-on-a-Chip (SoC), to follow suit and offer similar support on its Android operating systems currently supporting billions of consumer and connected devices .
In the enterprise sector, edge-based ambient sound processing remains in the nascent stage, with Infineon being one of the very early supplier of this technology. Increasingly, other sensor vendors are trialing the analysis of machine sound for uptime tracking, predictive maintenance, and machinery analytics. The combination of machine sound with other information, including temperature, pressure, and torque, can accurately predict the status of machine health and longevity.
Recognizing the importance of edge-based NLP and ambient sound processing, chipset vendors are actively forming partnerships to boost their capabilities. For example, Qualcomm has been working closely with prominent NLP startups, including Audio Analytics and Hugging Face. CEVA, a chipset IP vendor, announced a partnership with Fluent.ai to offer multilingual speech recognition technology in low-power audio devices. The recent collaboration between Synitiant and Renesas aims to provide a multimodal AI platform that combines deep learning-based visual and audio processing.
“Aside from dedicated hardware, machine learning developers are also looking to leverage various novel machine learning techniques such as multimodal learning and federated learning. Through multimodal learning, edge AI systems can become smarter and more secure if they combine insights from multiple data sources. With federated learning, end users can personalize voice AI in end devices, as edge AI can improve based on learning from their unique local environments,” concludes Su.
These findings are from ABI Research’s Deep Learning-Based Ambient Sound and Language Processing: Cloud to Edge application analysis report.