Using copyrighted books to train AI chatbot 'fair use'

A federal judge ruled late Monday that Anthropic, an artificial intelligence company, did not break the law when it trained its chatbot Claude on copyrighted books.But the company will still face a trial over how it first got millions of books — by downloading them from “shadow libraries” like Library Genesis.Normally, you cannot use copyrighted works without permission or attribution to the creator. The “fair use” doctrine is an exception created for special uses of the work, like transforming it significantly, though it’s a shaky area of law. For instance, an artist won a fair use case in 2006 for using another artist’s photo in a collage because he created something entirely different with it.Chatbots like Claude and ChatGPT are trained on massive quantities of written material. They use it to learn how human language works. When users ask them questions, they predict the most likely next word in their answers.The use of books, which authors often spend years writing, to train AI has been a subject of significant controversy in the writing and publishing communities.In the ruling, U.S. District Judge William Alsup said that using the books counted as fair use. He wrote that it was “quintessentially transformative.” “Like any reader aspiring to be a writer, Anthropic’s (AI large language models) trained upon works not to race ahead and replicate or supplant them — but to turn a hard corner and create something different,” Alsup wrote.Importantly, his ruling applied only to the particular way Anthropic went about using the books after it was criticized for downloading them. The company originally pirated millions of books online. Later, it started buying physical copies of some books, ripping off the covers and scanning the pages to make digital copies, according to the ruling. This second method was the one ruled legal.“Authors’ complaint is no different than it would be if they complained that training schoolchildren to write well would result in an explosion of competing works,” Alsup wrote.Anthropic, which was sued by a group of authors, will still go to trial for downloading the books illegally. “That Anthropic later bought a copy of a book it earlier stole off the internet will not absolve it of liability for the theft but it may affect the extent of statutory damages,” the ruling says. The authors who sued said doing so “seeks to profit from strip-mining the human expression and ingenuity behind each one of those works” in the lawsuit.The case sets an important precedent for other large language models, including popular ones like ChatGPT, and how they’re allowed to use centuries of human-created works.According to The Associated Press, Anthropic said Tuesday it was pleased that the judge viewed AI training as transformative and consistent with “copyright’s purpose in enabling creativity and fostering scientific progress.” Its statement did not address the piracy claims.The authors’ attorneys did not comment to the AP.

A federal judge ruled late Monday that Anthropic, an artificial intelligence company, did not break the law when it trained its chatbot Claude on copyrighted books.

But the company will still face a trial over how it first got millions of books — by downloading them from “shadow libraries” like Library Genesis.

Normally, you cannot use copyrighted works without permission or attribution to the creator. The “fair use” doctrine is an exception created for special uses of the work, like transforming it significantly, though it’s a shaky area of law. For instance, an artist won a fair use case in 2006 for using another artist’s photo in a collage because he created something entirely different with it.

Chatbots like Claude and ChatGPT are trained on massive quantities of written material. They use it to learn how human language works. When users ask them questions, they predict the most likely next word in their answers.

The use of books, which authors often spend years writing, to train AI has been a subject of significant controversy in the writing and publishing communities.

In the ruling, U.S. District Judge William Alsup said that using the books counted as fair use. He wrote that it was “quintessentially transformative.”

“Like any reader aspiring to be a writer, Anthropic’s (AI large language models) trained upon works not to race ahead and replicate or supplant them — but to turn a hard corner and create something different,” Alsup wrote.

Importantly, his ruling applied only to the particular way Anthropic went about using the books after it was criticized for downloading them. The company originally pirated millions of books online. Later, it started buying physical copies of some books, ripping off the covers and scanning the pages to make digital copies, according to the ruling. This second method was the one ruled legal.

“Authors’ complaint is no different than it would be if they complained that training schoolchildren to write well would result in an explosion of competing works,” Alsup wrote.

Anthropic, which was sued by a group of authors, will still go to trial for downloading the books illegally.

“That Anthropic later bought a copy of a book it earlier stole off the internet will not absolve it of liability for the theft but it may affect the extent of statutory damages,” the ruling says.

The authors who sued said doing so “seeks to profit from strip-mining the human expression and ingenuity behind each one of those works” in the lawsuit.

The case sets an important precedent for other large language models, including popular ones like ChatGPT, and how they’re allowed to use centuries of human-created works.

According to The Associated Press, Anthropic said Tuesday it was pleased that the judge viewed AI training as transformative and consistent with “copyright’s purpose in enabling creativity and fostering scientific progress.” Its statement did not address the piracy claims.

The authors’ attorneys did not comment to the AP.

Source link