A landmark ruling in the Bartz v. Anthropic case: according to U.S. Judge William Alsup, training AI systems using copyrighted works qualifies as fair use.
In August 2024, American authors Andrea Bartz, Charles Graerber and Kirk Wallace Johnson filed a lawsuit against Anthropic, the company behind AI chatbot Claude. According to the authors, Anthropic infringed their copyrights by including copies of their work in a general data library. This library consists of a dataset the company uses to train its Large Language Models (LLMs).
The then Head of Partnerships was commissioned by Anthropic to collect "all the books in the world" to compile the dataset. The library consists of texts from both legally purchased and illegally downloaded (copies of) books. Anthropic purchased millions of physical, mostly secondhand, books that it manually scanned and converted to PDF files. She also downloaded millions of copies of books from the websites Book3, Library Genenis and Pirate Library Mirror.
Anthropic then selected which texts from this dataset were most appropriate for training specific LLMs. These texts were combined into subsets. The texts in these subsets were tokenized and then used to train its LLMs.
According to Anthropic, the (illegal) copying of books was justified because, according to the company, it was necessary for the training of its LLMs.
Whether Anthropic was allowed to copy the books is judged under the U.S. fair use exception. In doing so, the court looks at several factors, including the purpose and character of the use, the nature of the protected work and the quantity and proportionality of the material used.
Judge Alsup's ruling is clear. Training AI using texts from books is permissible under the U.S. fair use exception.
Alsup made a more nuanced judgment about putting together the general library. According to Anthropic, it may have wanted to use the library for purposes other than training LLMs. However, even this use of the copyrighted works falls under the fair use exception. The decisive factor here is that the physical books were discarded after scanning and the digital versions were not further distributed. The digital files replaced the physical copies. This use thus falls under transformative use, or use that serves a new purpose or has new meaning.
Different is Alsup's verdict on illegal copies. Anthropic downloaded more than seven million books without paying for them. According to Alsup, there is no justification for illegally downloading books that can also be purchased legally or obtained in other permissible ways. Moreover, Anthropic did not keep these illegal copies only for training LLMs. As previously mentioned, these copies were also kept in the library for other possible purposes.
The fair use exception has also been invoked as a defense to allegations of copyright infringement in other pending proceedings. It remains to be seen whether this ruling will guide judges' rulings in those cases.