In a post on its AI research blog, Microsoft today detailed a new language system, Speller100, that the company claims is one of the most comprehensive ever made in terms of linguistic coverage and accuracy. Comprising a number of AI models that understand speech in over 100 languages collectively, Speller100 now powers all spelling correction on Bing, which previously only supported spell check for about two dozen languages.
For languages with little web presence, it’s challenging to collect an amount of data sufficient to train a model. Moreover, models can’t rely solely on training data to learn the spelling of a language. At its core, spelling correction is about building an error and a language model, and not all errors are the same. For example, non-word errors occurs when a word isn’t in the vocabulary for a given language, while real-word errors occur when the word exists but doesn’t fit in a larger context.
Speller100 is built around the concept of language families, or larger groups of languages based on similarities that multiple languages share. It also leverages zero-shot learning, a technique that allows a model to learn and correct spelling without additional language-specific labeled training data.
To scale Speller100 to over 100 languages, Microsoft says it developed a spelling correction pretraining approach that relies on functions to take text extracted from web pages and generate errors like deletion, addition, rotation, and replacement. This eliminated the need for a large dataset of misspelled searches, enabling Speller100 to reach 50% of correction recall for top candidates in languages for which zero training data existed. Deployed as-is on Bing, where about 15% of searches are misspelled, it would’ve reduced the number of misspellings by 7.5%.
To improve performance even further, Microsoft leveraged the orthographic, morphological, and semantic similarity between languages in the same group to build a dozen or so language family-based models, which maximized the zero-shot benefit and kept the model compact enough for runtime. This made Speller100 well-suited to spelling correction for languages with relatively little training data, like Afrikaans and Luxembourgish,
Microsoft says that to date on Bing, Speller100 has reduced the number of pages with no results by up to 30% and the number of times users had to manually reformulate their query reduced by 5%. It’s also increased the number of times users clicked on Bings spelling suggestion from 8% to 67%.
Microsoft says it plans to implement Speller100 in more of its products going forward.
“Spelling correction is the very first component in the Bing search stack because searching for the correct spelling of what users mean improves all downstream search components,” principle applied science manager Jingwen Lu, principle applied software engineering manager Jidong Long, and vice president Rangan Majumder said. “Our spelling correction technology powers several product experiences across Microsoft. Since it is important to us to provide all customers with access to accurate, state-of-the-art spelling correction, we are improving search so that it is inclusive of more languages from around the world with the help of large-scale AI.”
VentureBeat
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:
- up-to-date information on the subjects of interest to you
- our newsletters
- gated thought-leader content and discounted access to our prized events, such as Transform
- networking features, and more