Allen Institute launches GENIE, a leaderboard for human-in-the-loop language model benchmarking

January 21, 2021

There’s been an explosion in recent years of natural language processing (NLP) datasets aimed at testing various AI capabilities. Many of these datasets have accompanying leaderboards, which provide a means of ranking and comparing models. But the adoption of leaderboards has thus far been limited to setups with automatic evaluation, like classification and knowledge retrieval. Open-ended tasks requiring natural language generation such as language translation, where there are often many correct solutions, lack techniques that can reliably automatically evaluate a model’s quality.

To remedy this, researchers at the Allen Institute for Artificial Intelligence, the Hebrew University of Jerusalem, and the University of Washington created GENIE, a leaderboard for human-in-the-loop evaluation of text generation. GENIE posts model predictions to a crowdsourcing platform (Amazon Mechanical Turk), where human annotators evaluate them according to predefined, dataset-specific guidelines for fluency, correctness, conciseness, and more. In addition, GENIE incorporates various automatic machine translation, question answering, summarization, and common-sense reasoning metrics including BLEU and ROUGE to show how well they correlate with the human assessment scores.

As the researchers note, human-evaluation leaderboards raise a couple of novel challenges, first and foremost potentially high crowdsourcing fees. To avoid deterring submissions from researchers with limited resources, GENIE aims to keep submission costs around $100, with initial submissions to be paid by academic groups. In the future, the coauthors plan to explore other payment models including requesting payment from tech companies while subsidizing the cost for smaller organizations.

Evaluating generated text is hard! No automated method so far is up to the challenge.

Today we’re announcing GENIE, a human-in-the-loop leaderboard for streamlining text evaluation.

Learn more in today’s post on the AI2 Blog from @DanielKhashabi:https://t.co/I8v5egi9J7

— Allen Institute for AI (@allen_ai) January 19, 2021

To mitigate another potential issue — the reproducibility of human annotations over time across various annotators — the researchers use techniques including estimating annotator variance and spreading the annotations over several days. Experiments show that GENIE achieves “reliable scores” on the included tasks, they claim.

“[GENIE] standardizes high-quality human evaluation of generative tasks, which is currently done in a case-by-case manner with model developers using hard-to-compare approaches,” Daniel Khashabi, a lead developer on the GENIE project, explained in a Medium post. “It frees model developers from the burden of designing, building, and running crowdsourced human model evaluations. [It also] provides researchers interested in either human-computer interaction for human evaluation or in automatic metric creation with a central, updating hub of model submissions and associated human-annotated evaluations.”

The coauthors believe that the GENIE infrastructure, if widely adopted, could alleviate the evaluation burden for researchers while ensuring high-quality, standardized comparison against previous models. Moreover, they anticipate that GENIE will facilitate the study of human evaluation approaches, addressing challenges like annotator training, inter-annotator agreement, and reproducibility — all of which could be integrated into GENIE to compare against other evaluation metrics on past and future submissions.

“We make GENIE publicly available and hope that it will spur progress in language generation models as well as their automatic and manual evaluation,” the coauthors wrote in a paper describing their work. “This is a novel deviation from how text generation is currently evaluated, and we hope that GENIE contributes to further development of natural language generation technology.”

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

up-to-date information on the subjects of interest to you
our newsletters
gated thought-leader content and discounted access to our prized events, such as Transform
networking features, and more

Become a member

By VentureBeat Source Link

Allen Institute launches GENIE, a leaderboard for human-in-the-loop language model benchmarking

VentureBeat

LEAVE A REPLY Cancel reply

CYBER SECURITY NEWS

Online Safety Tips and free Cyber Safety and Crimes books

The National Cyber Crime Reporting Portal

Protect your online accounts from hackers and enable 2SV

Gartner Identifies Top Commercial Threats Facing Sales Leaders in 2025

Email Scams: Understanding, Identifying, and Protecting Yourself

Surge in long-lasting attacks: 35% exceeded one-month duration in 2024

TECH NEWS

High-performance computing, with much less code

Generative and agentic AI set to transform customer service into a strategic value driver for businesses

Generative AI and Machine Learning Set for Continued Investment

Gartner Identifies Top Supply Chain Technology Trends for 2025

Tech CEOs Must Take Several Mitigating Actions to Address Pitfalls

Telcos become part of expanding cloud ecosystem for enterprise digital transformations, says GlobalData

TOP NEWS

The National Cyber Crime Reporting Portal

Over 140,000 Tonnes of CO₂ Emissions Prevented by Uplink Community in 2023-2024

The Art and Science of Cryptography: Securing the Digital World

Automotive dealers need to adapt to technological advancements to remain competitive, says GlobalData

Cryptocurrency Scams: Understanding the Risks and How to Stay Safe

The Evolution of Remote Work: Transforming Business in the 21st Century

TECH NEWS & UPDATES

Simplilearn Professional Sentiment Survey Reveals 92 Percent See GenAI as Key to Career Growth...

8 Future Trends of AI in Healthcare

I love the Galaxy S25 Ultra, but the Pixel 9 Pro XL for $200...

WhatsApp Reportedly Working on Support for Motion Photos on Android

Honor Pad X9a With 11.5-inch LCD Screen, Snapdragon 685 SoC Launched

Allen Institute launches GENIE, a leaderboard for human-in-the-loop language model benchmarking

VentureBeat

RELATED ARTICLES

LEAVE A REPLY Cancel reply

CYBER SECURITY NEWS

TECH NEWS

TOP NEWS

TECH NEWS & UPDATES