ChatGPT Violated Copyright Laws to 'Profit Richly' From Authors' Works: Complaint
Two authors’ copyrighted materials were ingested and used to train ChatGPT without their consent, alleged a class action Wednesday (4:23-cv-03223) against OpenAI in U.S. District Court for Northern California in San Francisco.
Massachusetts residents Paul Tremblay and Mona Awad allege OpenAI and various financing funds benefit commercially "and profit richly” from the use of their and class members' copyrighted materials, without having been given consent. When the ChatGPT chatbot is prompted, it generates summaries of plaintiffs’ copyrighted works, “something only possible if ChatGPT was trained on Plaintiffs’ copyrighted works,” said the complaint.
ChatGPT’s large language models (LLMs), GPT-3.5 and GPT-4, are trained to emit natural language by copying “massive” amounts of text and extracting expressive information from it to create a “training dataset,” said the complaint. Once an LLM has copied and ingested the text in its training dataset, it's able to emit “convincingly naturalistic text outputs in response to user prompts.” A large language model’s output is “therefore entirely and uniquely reliant on the material in its training dataset,” so every time it assembles a text output, “the model relies on the information it extracted from its training dataset.”
ChatGPT’s output, like other LLMs, relies on the data upon which it's trained to generate new content, said the complaint. For example, if an LLM is prompted to generate writing in the style of a certain author, the LLM would generate content based on patterns and connections it learned from analysis of that author’s work within its training data.
On information and belief, ChatGPT is able to accurately summarize a certain copyrighted book “because that book was copied by OpenAI and ingested by the underlying OpenAI Language Model (either GPT-3.5 or GPT-4) as part of its training data,” alleged the complaint. When the chatbot was prompted to summarize books written by plaintiffs, “it generated very accurate summaries.” It also got "some details wrong” because an LLM “mixes together expressive material derived from many sources," it said. "The rest of the summaries are accurate,” indicating ChatGPT “retains knowledge of particular works” in the training dataset “and is able to output similar textual content,” said the complaint. “At no point did ChatGPT reproduce any of the copyright management information Plaintiffs included with their published works,” it said.
Plaintiffs assert for themselves and the class claims of direct and vicarious copyright infringement, violation of the Digital Millennium Copyright Act (DMCA), unfair competition, negligence and unjust enrichment. They seek permanent injunctive relief, including changes to ChatGPT to ensure compliance with the DMCA; statutory and other damages; pre- and post-judgment interest; attorneys’ fees and legal costs; and costs and expenses of a court-approved notice program for class notification.