OpenAI has transcribed over a million hours of YouTube videos for GPT-4 training

According to The New York Times, OpenAI has developed the Whisper audio transcription model and transcribed over a million hours of YouTube videos to obtain high-quality materials for training the GPT-4 model.

It is reported that the company was aware that such actions were legally questionable and in a "gray area" of copyright. However, it considers this to be a fair use of materials. OpenAI President Greg Brockman personally participated in the collection of videos that were used.

OpenAI exhausted its reserves of useful data in 2021 and discussed transcribing YouTube videos, podcasts, and audiobooks after reviewing other resources. By that time, the company was training its models on data that included computer code from Github, chess move databases, and the content of school assignments from Quizlet.

OpenAI spokesperson Lindsay Held stated that the company curates "unique" datasets for each of its models to "help them understand the world" and maintain competitiveness in global research. The company uses "numerous sources, including public data and partnerships for non-public data," and is exploring the possibility of generating its own synthetic data.

Google representative Matt Bryant stated that the company had "seen unconfirmed reports" about OpenAI's activities, adding that "both our robots.txt files and Terms of Service prohibit unauthorized copying or downloading of YouTube content."

Recently, YouTube CEO Neal Mohan stated that using the platform's data to train OpenAI's model violates their terms of use. Therefore, Google is taking "technical and legal measures" to prevent such unauthorized use, "if we have a clear legal or technical basis for it."

According to Times sources, Google also collected transcriptions from YouTube. Matt Bryant said that the company trained its models on "some YouTube content in accordance with our agreements with YouTube creators."

Meta also faced limitations in accessing good training data, and its AI team discussed unauthorized use of copyrighted works to catch up with OpenAI. After reviewing "almost available English-language books, essays, poems, and news articles on the internet," the company considered steps such as paying for book licenses or even directly purchasing a major publisher. Additionally, it was limited in ways to use user data due to privacy changes it made after the Cambridge Analytica scandal.

Source: The Verge