TECH & OTHER NEWS

Google’s AI lets users search language-agnostic knowledge bases in their native tongue

November 12, 2020

Entity linking fulfills a key role in grounded language understanding. Given a text mention of an entity (e.g., the word “helpful”), an algorithm identifies the entity’s corresponding entry in a knowledge base (such as a Wikipedia article). To extend its usefulness, researchers at Google propose a new technique where language-specific mentions resolve to a language-agnostic knowledge base. They describe a single entity retrieval model that covers over 100 languages and 20 million entities while ostensibly outperforming results from more limited cross-lingual tasks.

Multilingual entity linking involves linking a text snippet in some context to the corresponding entity in a language-agnostic knowledge base. Knowledge bases are essentially databases comprising information about entities — people, places, and things. In 2012, Google launched a knowledge base, the Knowledge Graph, to enhance search results with hundreds of billions of facts gathered from sources including Wikipedia, Wikidata, and CIA World Factbook. Microsoft provides a knowledge base with over 150,000 articles created by support professionals who have resolved issues for its customers.

Knowledge bases in multilingual entity linking may include textual information like names and descriptions about each entity in one or more languages. But they make no prior assumption about the relationship between these knowledge base languages and the mention-side language.

The Google researchers used what’s called enhanced dual encoder retrieval models and WikiData as their knowledge base, which canvasses a large set of diverse entities. WikiData contains names and short descriptions, but through its close integration with all Wikipedia editions, it also connects entities to rich descriptions (and other features) drawn from the corresponding language-specific Wikipedia pages.

Google entity model

The researchers extracted a large-scale dataset of 684 million mentions in 104 languages linked to WikiData entities, which they say is at least six times larger than datasets used in prior English-only linking work. In addition, the coauthors created a matching dataset — Mewsli-9 — that spans a diverse set of languages and entities, including 289,087 entity mentions appearing in 58,717 news articles from WikiNews. (Only 11% of the 82,162 distinct target entities in Mewsli-9 don’t have English Wikipedia pages, setting an upper bound on systems focused on English Wikipedia entities.)

The researchers say the results show that entity linking can better reflect the real-world challenges of rare entities and/or low resource languages. “Operationalized through Wikipedia and WikiData, our experiments using enhanced dual encoder retrieval models and frequency-based evaluation provide compelling evidence that it is feasible to perform this task with a single model covering over a 100 languages,” they wrote. “Our automatically extracted Mewsli-9 dataset serves as a starting point for evaluating entity linking beyond the entrenched English benchmarks and under the expanded multilingual setting.”

It’s unclear whether the researchers’ models exhibits demographic bias, however. In a paper published earlier this year, Twitter researchers claimed to have found evidence of prejudice in popular named entity recognition models, particularly with respect to Black and other “non-white” names. But the Google coauthors leave the door open to using non-expert human raters to improve the quality of the training dataset and incorporate relational knowledge.

How startups are scaling communication: The pandemic is making startups take a close look at ramping up their communication solutions. Learn how

By VentureBeat Source Link

Google’s AI lets users search language-agnostic knowledge bases in their native tongue

LEAVE A REPLY Cancel reply

TECH NEWS

Everything Old is New Again: AI-Driven Development and Open Source

Gen AI in Healthcare: The State of Affairs in India

Gartner Predicts Legal, Risk and Compliance Functions to Double Technology Spend...

Microsoft to End Support for Windows Mail, Calendar and People Apps...

IDC Predicts: Asia/Pacific Business Leaders to Demand 80% Success Rate on...

The Cooling Conundrum: AI and Automation Push Data Centers Toward 3X...

TOP STORIES

Seventy Percent of Economies Are Underprepared for AI Disruption

New study shows almost half of tech professionals in India believe...

Organizations Combining Organizational Learning and AI-Specific Learning Are up to 80%...

Nvidia’s AI-driven triumph over Intel powered by strategic innovations

Most banks and insurers adopt cloud solutions with the primary objective...

India’s Web3 Ecosystem Has Over 400 Firms, Karnataka Emerges as Industry...

Cyber Security

AI and Gen AI are set to transform cybersecurity for most...

ThreatQuotient Publishes 2024 Evolution of Cybersecurity Automation Adoption Research Report

Kaspersky predicts quantum-proof ransomware and advancements in mobile financial cyberthreats in...

Rising concerns, lingering gaps: most organizations fear AI-driven cyberattacks but lack...

Tenable Forecasts Data Security in the Cloud to Take Centre Stage...

Blockchain-Enhanced Cybersecurity-Safeguarding Digital Identities and Data