Science & technology

Report: Apple, Nvidia Trained AI Models on YouTube Captions Without Permission

The use of copyrighted data for AI model training is contentious, with authors who have work present in the Books3 dataset filing a lawsuit against AI makers last year. OpenAI chief technology officer Mira Murata stated in March that he 'wasn’t sure' if their Sora video generator was trained on...

by Improve the News Foundation

Updated Jul 18, 2024

0:00

/1861

Facts

An investigation has claimed that a dataset used to help train artificial intelligence (AI) models from companies such as Apple, Anthropic, and Nvidia contains subtitles from over 100K YouTube videos that were included without the consent of the content creators.¹
YouTube Subtitles, which is part of a large dataset known as The Pile, contains captions from over 173K videos that span 48K channels. Taking data from the platform without prior approval would violate YouTube guidelines.²
The dataset was first released in 2020, with a Google spokesperson saying that the company has taken action against "abusive, unauthorized scraping." Channels included in the dataset include Harvard, MrBeast, and the BBC.³
A spokesperson for Anthorpic confirmed that The Pile was used to train their AI assistant Claude, while saying that YouTube's terms only cover "direct use" of the platform. Apple, Nvidia, and Salesforce have previously described using The Pile for model training.¹
The use of copyrighted data for AI model training is contentious, with authors who have work present in the Books3 dataset filing a lawsuit against AI makers last year. OpenAI chief technology officer Mira Murata stated in March that he "wasn’t sure" if their Sora video generator was trained on YouTube videos.⁴
Captions present in the dataset reportedly include profanity, slurs, and captions from videos that have been deleted from the platform.¹

Sources: ¹Proof, ²News Nation Now, ³Futurism and ⁴Verge.

Narratives

Narrative A, as provided by The Conversation. In their quest to gobble up content to train their models, AI companies have run roughshod over the rights any creator who has their work present on the internet. Copyrighted data present in a training set can be reproduced almost exactly by end users, in many instances, as these lucrative AI tools are built on the backs of uncompensated creators.
Narrative B, as provided by TidBits. The hysteria over data scraping for AI training has reached a fever pitch, and it would be akin to an author suing a child for learning to read using one of their books. AI models do not actually copy content verbatim, but use it to adjust probability values to make human-seeming output. AI generated material will complement, not replace, the work of humans.