This Is How Meta AI Staffers Deemed More Than 7 Million Books to Have No “Economic Value”

This Is How Meta AI Staffers Deemed More Than 7 Million Books to Have No “Economic Value”


Last month, I prompted Meta AI with a question: “Is it okay to use pirated books to train AI?” The machine’s response was unequivocal. “Using pirated books to train AI is not okay,” it admonished. “In fact, it’s a serious violation of copyright laws and can have significant consequences for authors and publishers. Recently, it’s been reported that Meta used millions of pirated books to develop its AI programs, which has sparked widespread outrage and condemnation from the writing community.”

A few days after my original query, for reasons best known to itself, the algorithm lawyered up. In response to the same prompt, it hedged, “Using pirated books to train AI is a complex issue, and the answer isn’t straightforward. While some companies, like Meta, argue that their use of copyrighted material falls under fair use, others claim it infringes on copyright laws.” (In a 2024 statement, Meta describes such varying responses as “hallucinations,” a challenge for all generative AI systems.)

Lawyers for Meta are indeed invoking that very “fair use” defense in a copyright suit that’s been wending its way through the Northern District of California legal system for nearly two years. Richard Kadrey et al. v. Meta Platforms—and the reams of confidential Meta communications, newly in the public record as exhibits for the plaintiffs—offers an unprecedented look at the internal maneuverings behind the company’s decision to train its model on a database containing more than 7 million pirated books.

Just last week, plaintiffs—including Pulitzer Prize–winners Andrew Sean Greer and Junot Díaz, and comedian Sarah Silverman—filed a summary judgment motion that reads, “It is now undisputed that Meta torrented tens of millions of pirated books and other copyrighted works, including over 650 copies of Plaintiffs’ Books, for free and without consent from the rightsholders because it did not want to pay for them.” The plaintiffs, figureheaded by Richard Kadrey, the bestselling author of, among other books, the Sandman Slim series, claim that Meta’s “unlawful conduct,” used in service of training its large language model (LLM), infringed on their work. In its own motion, filed last month, Meta claims, as it has since its first motion to dismiss filed in September of 2023, that its Llama (large language model Meta AI) project is “highly transformative” and therefore fair use. (Asked for comment, a Meta spokesperson provided a statement that said, in part, that fair use of copyrighted materials is vital to the development of the company’s open-source AI models. “We disagree with Plaintiffs’ assertions, and the full record tells a different story.” An amicus brief filed last week by the Association of American Publishers on behalf of the plaintiffs argues against this claim: “There is nothing transformative about the systematic copying and encoding of textual works, word by word, into an LLM. It does not involve criticism or commentary, provision of a search or indexing utility, software interoperability, or any other purpose recognized as transformative under fair use precedents.”)

The lawsuit is one of more than 16 copyright cases concerning generative AI tools and the multibillion dollar entities that create them currently rippling through the US court system, ranging from musicians suing Anthropic for using lyrics to train its AI, to visual artists suing Stability AI, to The New York Times suing Microsoft, to Authors Guild v. OpenAI, which is being heard in the Southern District of New York and is expected to go to summary judgment this fall. (Condé Nast, Vanity Fair’s parent company, is a plaintiff in one class action lawsuit against the enterprise AI platform Cohere.) The cases raise existential questions about art and literature—their inherent worth and what it means to commodify them—and arrive at a time when generative AI tools are making technical strides.

Kadrey et al. has attracted particular attention. One of Meta’s most prominent lawyers, Mark Lemley, quit the case earlier this year—not because he doesn’t believe in its merit, but because of what he described in a LinkedIn post as the company and its CEO Mark Zuckerberg’s “descent into toxic masculinity and Neo-Nazi madness.” Then, last month, Meta tried to block the promotion of a tell-all memoir by a former employee, which did not further endear the corporation to the literary community. Perhaps most importantly, the et al. plaintiffs are a cadre of big names—besides Greer, Silverman, and Díaz, they include the satirist Matthew Klam, and the National Book Award–Winners Ta-Nehisi Coates and Jacqueline Woodson. (Coates and Woodson are VF contributors.)

A court case, like a work of literature, relies on a good story told persuasively. An interesting wrinkle in this case is that part of the story Meta needs to tell is how little individual books and authors meant in the creation of Llama. (“Is it—do you pronounce it ‘Llama?’” the judge wondered, early on.) Accordingly, a noteworthy argument from the defense was revealed in a court filing last week: “There is no allegation or evidence that the copies Meta made were used for reading Plaintiffs books by Meta employees or anyone else.”



Source link