'Hoovering Up Content'

Google Harvests Millions of Users' Personal Data to Train AI Products, Says Class Action

July 13, 2023 by Rebecca Day|Top News

Google’s revised privacy policy “purports to give it ‘permission’ to take anything shared online” to train and improve its artificial intelligence (AI) products, including personal and copyrighted information, said a class action Tuesday (docket 3:23-cv-03440) in U.S. District Court for Northern California in San Francisco.

The privacy lawsuit was brought by plaintiffs J.L., a New York Times author and investigative journalist residing in Texas; C.B., an actor and professor living in California; K.S., a 6-year-old resident of Florida with a Gmail account; Gmail and Google Bard user P.M., of California; N.G., a California resident and Google user; Florida resident and Google user R.F.; California resident J.D., a Google Hangout user; and 13-year-old G.R., a Gmail and Google Hangouts user. The plaintiffs didn't consent to Google's use of their private information, said the complaint.

Google acquired DeepMind for over $500 million in 2014 and used its large-language model (LLM) technology to develop its Bard conversational AI chat service, said the complaint. The company formed Google AI in 2017 and has released numerous AI products to the market for commercial and personal use, said the complaint. Its “Transformer” neural network underpins LLMs that are the basis for numerous AI chatbots, it said.

Google released Bard publicly in May in 180 countries. “What AI enthusiasts failed to grant equal attention to was the cost to humanity associated with the rapid, rampant, unregulated proliferation of the AI products,” said the complaint. Google’s AI products were all built using “private, personal, and/or copyrighted materials without proper consent or fair compensation,” it said; Bard is the most prominent and publicly accessible of them. Due to its “vast training data,” Bard can generate human-like answers and has coding capabilities and "advanced math and reasoning skills.” It can replicate “and mimic all the artists, authors, and creators on whose content it was trained in order to generate ‘art,’” the complaint said.

To launch its AI products, Google secretly harvested “millions” of consumers’ personal data -- “at the expense of privacy, security and ethics” -- and without their knowledge or consent, said the complaint. The LLMs powering Google’s AI products depend on consuming “huge amounts of data” for training, it said. Conversational data between humans is most valuable to help the product develop human-like communication capabilities; creative and expressive works are also valuable for AI products to learn how to “create” art, it said. “The only reason Defendants’ Products exist is because all this personal information was used to train the LLMs,” it said.

To get user data, Google used bots to scrape at least 1.56 trillion words of “public dialog data and other public web documents,” including personal information, without giving notice, seeking consent or paying for it, said the complaint. Google and DeepMind “did so in secret and without registering as a data broker as required under applicable law,” it said. “The breadth of Google’s data collection without permission impacts essentially every internet user ever, raising serious legal, moral, and ethical questions," it said.

Regulators and courts worldwide are seeking to crack down on AI companies “hoovering up content without consent or notice,” but Google’s response has been to “keep their training datasets largely secret," said the complaint: “Google has not permitted any regulatory or other audit access." The company’s C-4 dataset is taken from the Common Crawl data set, “petabytes" of data collected over 12 years and owned by a nonprofit organization, it said.

The Common Crawl “was never intended to be taken en masse, and turned into an AI product for commercial gain,” the complaint said. Upon information and belief, the 501(c)(3) overseeing the Common Crawl “did not consent to this mass misappropriation and data laundering of personal data,” nor did it obtain consent of users whose personal data was scraped, it said. Google continued to feed the AI products by “harnessing data gleaned from various … Google services," including Gmail and Search, “a pervasive and unconscionable invasion of users’ personal spheres, exploiting the contents of private communications to feed their AI’s voracious appetite for personal information,” it said.

Plaintiffs charge Google and DeepMind with violating California’s Unfair Competition Law, plus negligence, invasion of privacy, intrusion upon seclusion, larceny/receipt of stolen property, conversion, unjust enrichment, direct and vicarious copyright infringement and violation of the Digital Millennium Copyright Act.

They seek a temporary freeze on Google’s development and use of AI products until establishment of an AI council responsible for approving use of products. They also seek transparency protocols, the ability for consumers to opt out of Google's data collection, establishing technological safety measures “that will prevent the technology from surpassing human intelligence and harming others," regular reviews, establishing a fund to compensate class members for “past and ongoing misconduct,” and confirmation defendants purged class members’ personal information. They seek actual, statutory and monetary damages, restitution and disgorgement.

Google has been "clear for years that we use data from public sources -- like information published to the open web and public datasets -– to train the AI models behind services like Google Translate, responsibly and in line with our AI Principles," emailed General Counsel Halimah DeLaine Prado Wednesday. "American law supports using public information to create new beneficial uses, and we look forward to refuting these baseless claims.”