AI Weekly: The challenges of creating open source AI training datasets

February 20, 2021

In January, AI research lab OpenAI released Dall-E, a machine learning system capable of creating images to fit any text caption. Given a prompt, Dall-E generates photos for a range of concepts, including cats, logos, and glasses.

The results are impressive, but training Dall-E required building a large-scale dataset that OpenAI has so far opted not to make public. Work is ongoing on an open source implementation, but according to Connor Leahy, one of the data scientists behind the effort, development has stalled because of the challenges in compiling a corpus that respects both moral and legal norms.

“There’s plenty of not-legal-to-scrape data floating around that isn’t [fair use] on platforms like social media, Instagram first and foremost,” Leahy, who’s a member of the volunteer AI research effort EleutherAI, told VentureBeat. “You could scrape that easily at large scale, but that would be against the terms of service, violate people’s consent, and probably scoop up illegal data both due to copyright and other reasons.”

Indeed, creating AI training datasets in a privacy-preserving, ethical way remains a major blocker for researchers in the AI community, particularly those who specialize in computer vision. In January 2019, IBM released a corpus designed to mitigate bias in facial recognition algorithms that contained nearly a million photos of people from Flickr. But neither the photographers nor the subjects of the photos were notified by IBM that their work would be included. Separately, an earlier version of ImageNet, a dataset used to train AI systems around the world, was found to contain photos of naked children, porn actresses, college parties, and more — all scraped from the web without those individuals’ consent.

“There are real harms that have emerged from casual repurposing, open-sourcing, collecting, and scraping of biometric data,” said Liz O’Sullivan, cofounder and technology director at the Surveillance Technology Oversight Project, a nonprofit organization litigating and advocating for privacy. “[They] put people of color and those with disabilities at risk of mistaken identity and police violence.”

Techniques that rely on synthetic data to train models might lessen the need to create potentially problematic datasets in the first place. According to Leahy, while there’s usually a minimum dataset size needed to achieve good performance on a task, it’s possible to a degree to “trade compute for data” in machine learning. In other words, simulation and synthetic data, like AI-generated photos of people, could take the place of real-world photos from the web.

“You can’t trade infinite compute for infinite data, but compute is more fungible than data,” Leahy said. “I do expect for niche tasks where data collection is really hard, or where compute is super plentiful, simulation to play an important role.”

O’Sullivan is more skeptical that synthetic data will generalize well from lab conditions to the real world, pointing to existing research on the topic. In a study last January, researchers at Arizona State University showed that when an AI system trained on a dataset of images of engineering professors was tasked with creating faces, 93% were male and 99% white. The system appeared to have amplified the dataset’s existing biases — 80% of the professors were male and 76% were white.

On the other hand, startups like Hazy and Mostly AI say that they’ve developed methods for controlling the biases of data in ways that actually reduce harm. A recent study published by a group of Ph.D. candidates at Stanford claims the same — the coauthors say their technique allows them to weight certain features as more important in order to generate a diverse set of images for computer vision training.

Ultimately, even where synthetic data might come into play, O’Sullivan cautions that any open source dataset could put people in that set at greater risk. Piecing together and publishing a training dataset is a process that must be undertaken thoughtfully, she says — or not at all, where doing so might result in harm.

“There are significant worries about how this technology impacts democracy and our society at large,” O’Sullivan said.

For AI coverage, send news tips to Khari Johnson and Kyle Wiggers and AI editor Seth Colaner — and be sure to subscribe to the AI Weekly newsletter and bookmark our AI channel, The Machine.

Thanks for reading,

Kyle Wiggers

AI Staff Writer

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

up-to-date information on the subjects of interest to you
our newsletters
gated thought-leader content and discounted access to our prized events, such as Transform
networking features, and more

Become a member

By VentureBeat Source Link

AI Weekly: The challenges of creating open source AI training datasets

VentureBeat

LEAVE A REPLY Cancel reply

CYBER SECURITY NEWS

New Research by FIS and Oxford Economics Finds That Cyberthreats, Fraud, Regulatory Complexities and Financial Inefficiencies Cost Businesses $100...

Cognyte 2025 Threat Landscape Report Reveals Global Trends in Cyberattacks, Ransomware and Stolen Credentials

Cyberattacks are Fewer in Number but Smarter in Strategy, Says CDW Canada Study

Understanding Identity Theft: How It Works and How to Protect Yourself

Understanding Online Financial Frauds and How to Stay Protected

3.6 times surge in mobile banking malware and 83% crypto phishing spike: New financial cyberthreats report by Kaspersky

TECH NEWS

Researchers teach LLMs to solve complex planning challenges

Why Businesses Develop with Offshore Software Development Company

To Thrive Amid Volatility, Leaders Must Optimize Technology Investments, Excel At Driving Change, And Proactively Manage Risk: Forrester

Domestic APMs are outpacing international APMs to become the primary way for emerging markets to interact in the digital...

Syneriq Global’s Hyderabad GCC – A New Era for AI Product Engineering: Sudhakar Pennam

Tap and Go: How Gen Z is Revolutionising Payment Technology

TOP NEWS

Global IT and Business Services Market Remained Resilient in Q1, Despite Heightened Economic Uncertainty: ISG Index

Survey of 266 Senior Enterprise Risk Executives Reveals Shifts in Top Five Emerging Risks

How to Fact-Check Online: A Comprehensive Guide

FICO Data Uncovers Positive Impact Pandemic Had on UK Consumers’ Payments

Understanding Identity Theft: How It Works and How to Protect Yourself

CEOs Are Relying on Employee Productivity to Fuel Organizational Growth in 2025 and Beyond

TECH NEWS & UPDATES

SHRM TECH 2.0: Where Innovation Meets the Future of Work

India has its own growth momentum says Ashish Chauhan, MD & CEO, National Stock...

Tariffs Are the Great Equalizer. Autonomous Trucking Will Be the Brand Differentiator

14% increase in spyware attacks on businesses in Africa

GSMA Forecasts Mobile will Add $2 Trillion to China’s Economy by 2030

AI Weekly: The challenges of creating open source AI training datasets

VentureBeat

RELATED ARTICLES

LEAVE A REPLY Cancel reply

CYBER SECURITY NEWS

TECH NEWS

TOP NEWS

TECH NEWS & UPDATES