The ongoing legal debate surrounding the use of copyrighted materials to train AI systems has taken a new turn as OpenAI faces another lawsuit. This time, authors have sued the company, claiming that their novels’ copyrights have been violated by OpenAI’s AI chatbot, ChatGPT.
The proposed class action was filed in a federal court in San Francisco, asserting that OpenAI engaged in the unauthorized collection of copyrighted works without obtaining consent, providing credit, or offering compensation. The authors argue that OpenAI illegally downloaded copies of their novels to train ChatGPT, resulting in the AI system generating summaries of their books upon request, which they believe is evidence of copyright infringement.
According to the lawsuit, the software programs, known as large language models, that power ChatGPT are considered derivative works, infringing upon the exclusive rights of the authors under the Copyright Act. The authors contend that OpenAI’s AI system relies heavily on the information extracted from their copyrighted material and cannot function without it.
The authors specifically take issue with OpenAI’s actions of downloading hundreds of thousands of books to train its AI system. OpenAI had previously disclosed that it fed GPT-1, its initial large language model, a collection of over 7,000 novels from BookCorpus, which was assembled by a team of AI researchers. However, the complaint alleges that these novels, hosted on a website called Smashwords.com, were primarily under copyright and were copied into the dataset without authorization, credit, or compensation to the authors.
The lawsuit further claims that subsequent versions of OpenAI’s large language models, including GPT-3, were trained on larger quantities of copyright-protected works obtained from various shadow library websites, such as Library Genesis, Z-Library, Sci-Hub, and Bibliotik. These shadow libraries, considered illegal sources, have attracted the attention of the AI-training community due to their vast collection of books. The authors’ attorney, Joseph Saveri, also representing programmers in a separate class action against OpenAI and Microsoft, cites an AI training dataset published by EleutherAI in December 2020, which included nearly 200,000 books recreated from the Bibliotik collection.
OpenAI has discontinued disclosing information about the sources of its dataset, citing competitive and safety concerns surrounding large-scale models like GPT-4.
The lawsuit, seeking to represent a nationwide class of hundreds of thousands of authors in the United States, was brought by Paul Tremblay and Mona Awad. Tremblay is known for his novel “The Cabin at the End of the World,” which was adapted into the film “Knock at the Cabin” by M. Night Shyamalan. The complaint alleges direct copyright infringement, vicarious copyright infringement, violations of the Digital Millennium Copyright Act, unjust enrichment, negligence, and other claims against OpenAI.