The website The Atlantic examined a dataset utilized for training AI models owned by Apple, Anthropic, and Nvidia specifically, and found that the concerns of film industry workers regarding the new technology are indeed well-founded.
The dataset included elements from 53,000 movies and 85,000 TV shows: notably, all films nominated for "Best Picture" from 1950 to 2016, about 600 episodes of "The Simpsons," 170 episodes of "Seinfeld," 45 episodes of "Twin Peaks," as well as all episodes of "Breaking Bad" and "The Sopranos." The dataset also contained “live” dialogues from the broadcasts of the “Golden Globes” and “Oscars.”
The Atlantic notes that the texts included in the dataset are not original scripts but subtitles sourced from OpenSubtitles.org. Users typically extract these from DVDs, Blu-rays, and streaming platforms using optical character recognition software, and then upload them to the site (which currently hosts over 9 million subtitle files in more than 100 languages and dialects).
Moreover, some companies acknowledge their use of subtitles in their research papers: for instance, Anthropic trained the chatbot Claude on them, Meta used them for a group of large language models called Open Pre-trained Transformer (OPT), Apple applied them to LLMs developed for iPhones, and Nvidia utilized the NeMo Megatron LLM. OpenSubtitles.org has also been extensively used by developers like Salesforce, Bloomberg, EleutherAI, Databricks, Cerebras, and others in the AI field.
Apple mentioned in a statement that its LLMs are intended “solely for research purposes,” while Salesforce claimed that the dataset “has never been used to inform or enhance any of the company’s product offerings.” The other companies mentioned in the article either declined to comment or did not respond to inquiries.
The question of the legality of using data for training artificial intelligence remains unresolved—especially since the “boom” of text bots following the launch of ChatGPT. Corporate transparency is still quite low, and compelling them to disclose data might require legal action, but, as seen in the case of OpenAI, even such information can suddenly disappear.
It seems that Vince Gilligan, the writer of the drama "Breaking Bad," had some insight when he referred to generative artificial intelligence as a “highly complex and energy-intensive form of plagiarism” last year—how would he react knowing that the technology is already utilizing dialogues he penned?
Comments (0)
There are no comments for now