TECH & OTHER NEWS

OpenAI transcribed over a million hours of YouTube videos to train its LLMs, Google engaged in same practice

April 8, 2024

A hot potato: One of the many controversial elements surrounding generative AIs and their large language models’ (LLM) training data is the potential copyright infringements. It’s a topic under the spotlight once again following a report that OpenAI transcribed over a million hours of YouTube videos to train GPT-4. Why didn’t YouTube owner Google object? Because it did the same thing.

In order to access more reputable English language-based text on the internet in 2021, OpenAI researchers created a speech recognition tool called Whisper, writes The New York Times. It was designed to transcribe audio from YouTube videos, giving the company a trove of data to train its LLMs.

OpenAI reportedly knew that scraping YouTube data was legally questionable but did it anyway, assuming such action would be fair use. The Times writes that OpenAI president Greg Brockman was personally involved in collecting videos that were transcribed.

One would imagine Google being less than happy about OpenAI’s actions, but that would have been hypocritical given that Google also transcribed YouTube videos for its AI models, potentially violating creators’ copyrighted material.

YouTube CEO Neal Mohan said during an interview with Bloomberg last week that the platform’s terms of service do not permit unauthorized transcripts or downloading of video content. When asked about OpenAI’s transcribing, he said, “I have seen reports that it may or may not have been used. I have no information myself.”

Google spokesperson Matt Bryant repeated the ToS rules, adding that the company takes “technical and legal measures” to prevent this sort of unauthorized practice “when we have a clear legal or technical basis to do so.” Google said that its AI models “are trained on some YouTube content” that is allowed under agreements with creators.

The NY Times states that Google has expanded its terms of service, giving it more rights to use consumer data such as publicly available Google Docs and restaurant reviews on Google Maps for the company’s AI models. The revised policy was released on July 1 in the hope that the Independence Day weekend would act as a distraction.

Meta was also said to be considering shady methods of attaining more data for its LLM training. The NY Times writes that the Facebook parent considered collecting copyrighted data from the internet, even if that meant facing lawsuits, as negotiations with license holders would take too long.

Thousands of organizations and individuals are complaining and filing lawsuits against large AI companies over the use of their content without payment or acknowledgment. The New York Times is suing OpenAI and Microsoft for using its copyrighted news articles. In February, OpenAI accused the publication of paying someone to “hack” its famous chatbot and other products to generate misleading evidence supporting these claims.

Masthead: Souvik Banerjee

Source Link

LEAVE A REPLY Cancel reply

TECH NEWS

TOP STORIES

Cyber Security