TECH & OTHER NEWS

OpenAI and Google reportedly used YouTube transcripts to train their AI models

April 8, 2024

YouTube on iPhone — Get ready for a brand new YouTube experience.

Maria Diaz/ZDNET

Training artificial intelligence models requires a lot of data to help them better understand the context of queries and ultimately provide better responses. In the constant search for more data, both OpenAI and Google have turned to using YouTube videos, created by others, to train their large language models (LLMs), The New York Times reported over the weekend, citing people who claim to have knowledge of the companies’ activities.

In 2023, OpenAI developed Whisper, a speech recognition tool that would help the company scrape YouTube, take audio from more than 1 million YouTube videos, and use that to inform GPT-4, according to the Times’ sources.

Google, meanwhile, also transcribed YouTube videos, according to the report. What’s more, the search giant changed its terms of service in 2023 to make it easier to sweep up public Google Docs, Google Maps restaurant reviews, and other publicly available content for use in its AI models, according to the Times.

Also: Have 10 hours? IBM will train you in AI fundamentals – for free

It’s no secret that AI models require significant troves of data to operate efficiently. More data, including text, audio, and videos, gives models the ability to understand human context, human interaction, and other critical communication details that make them more effective.

However, there’s increasing tension between the companies developing those models and the content creators. What content, if any, should be permissible to use in training AI models? In a growing number of cases, news outlets, websites, and content creators themselves are calling on OpenAI, Google, Meta, and other tech companies to pay for access to their content before they can be used to train LLMs.

In some cases, model makers have complied and signed agreements with companies, including Reddit and Stack Overflow, to get access to user data. In other cases, not so much.

According to The New York Times’ report, for instance, OpenAI’s alleged transcription of more than 1 million YouTube videos may run afoul of Google’s own terms of service, which prevent third-party applications from using its YouTube videos for “independent” means. Additionally, the companies’ decisions to allegedly transcribe videos may run afoul of copyright laws, since YouTube creators who upload videos to YouTube still retain the copyright to the content they create.

To be clear, the Times report cannot be independently verified. Also, neither Google nor OpenAI acknowledged that they scraped data illegally. We do know, however, that the companies are running out of ways to access more content. What’s worse, a Times source said that it’s possible tech companies will run out of content to ingest into their models by 2026.

Also: I spent a weekend with Amazon’s free AI courses, and highly recommend you do too

What then? It’s entirely possible — and perhaps, likely — that the tech companies move to sign licensing agreements with content creators, media outlets, and even musical artists to access their creations. It’s also possible they will further change their terms of service, or worse, find ways to skirt privacy laws, to access the data they currently can’t.

It’s clear that the amount of data companies like Meta, Google, and OpenAI will need in the coming years will only increase. It’s critical that as they access that data, they do so in a way that doesn’t harm the people who created the content in the first place.

Source Link

OpenAI and Google reportedly used YouTube transcripts to train their AI models

LEAVE A REPLY Cancel reply

TECH NEWS

Everything Old is New Again: AI-Driven Development and Open Source

Gen AI in Healthcare: The State of Affairs in India

Gartner Predicts Legal, Risk and Compliance Functions to Double Technology Spend...

Microsoft to End Support for Windows Mail, Calendar and People Apps...

IDC Predicts: Asia/Pacific Business Leaders to Demand 80% Success Rate on...

The Cooling Conundrum: AI and Automation Push Data Centers Toward 3X...

TOP STORIES

Seventy Percent of Economies Are Underprepared for AI Disruption

New study shows almost half of tech professionals in India believe...

Organizations Combining Organizational Learning and AI-Specific Learning Are up to 80%...

Nvidia’s AI-driven triumph over Intel powered by strategic innovations

Most banks and insurers adopt cloud solutions with the primary objective...

India’s Web3 Ecosystem Has Over 400 Firms, Karnataka Emerges as Industry...

Cyber Security

AI and Gen AI are set to transform cybersecurity for most...

ThreatQuotient Publishes 2024 Evolution of Cybersecurity Automation Adoption Research Report

Kaspersky predicts quantum-proof ransomware and advancements in mobile financial cyberthreats in...

Rising concerns, lingering gaps: most organizations fear AI-driven cyberattacks but lack...

Tenable Forecasts Data Security in the Cloud to Take Centre Stage...

Blockchain-Enhanced Cybersecurity-Safeguarding Digital Identities and Data