The Anthropic Lawsuit: How Do We Know Which Books AI Really Used?

What Is the Anthropic Lawsuit About?

A federal judge in California has ruled that a class action lawsuit against Anthropic, the company behind Claude AI, can move forward. Anthropic is backed by Amazon and Alphabet. The lawsuit claims Anthropic downloaded millions of books from pirate websites and used them to train its AI.

The authors leading the lawsuit, Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson, describe it as “Napster-style downloading.” They accuse Anthropic of copying as many as 7 million books from pirate databases like Library Genesis (LibGen) and Pirate Library Mirror (PiLiMi). According to court records, Anthropic did not just download the books. They stored them, catalogued them, and selected some of them for AI training without paying copyright holders or asking for permission.

The court’s ruling is very specific. If Anthropic had legally purchased the books, using them for AI training might have been considered fair use. The court ruled that downloading pirated copies is not protected under fair use. That’s the difference.

If the authors win this case, Anthropic could be on the hook for billions of dollars in damages. Each individual book could qualify for its own statutory penalty.

The Bigger Question: How Do We Know Which Books Were Actually Used?

While I think the ruling is fair, here’s my question: How will anyone know for sure which books Anthropic actually used to train Claude AI?

Think about it. Just because a book is listed on LibGen or PiLiMi does not mean it was downloaded. And just because Anthropic downloaded a file does not mean it was selected for model training. AI companies work with huge datasets. At some point, logically, they filter the data to pick what they believe will make the AI smarter.

What about the books they downloaded but didn’t use? Should that be part of the case too, or is that a different lawsuit? From a legal standpoint, this lawsuit focuses on books that were both obtained and used in training. There’s a gray area about storage, possession, and potential use. Copyright law may eventually need to address all of that.

Can Anthropic Be Trusted to Tell the Truth About Its Data?

Let’s say I gave you a list of every book on a pirate website. Then I handed you a list of every book allegedly Anthropic downloaded. Now let’s compare that to the books Anthropic actually used to train Claude AI. How do you sort that out?

The irony is, AI might be the only way to handle it. The scale of the data is massive. Millions of books, billions of words. Anthropic allegedly created its own searchable digital library. That means they catalogued each book they acquired, with metadata like ISBNs, titles, and authors. Their engineers reviewed the metadata to decide which books were “best for training.” They even kept notes on the process.

The court ordered Anthropic to help produce a list of the affected books by August 1. The authors have to file their detailed list of infringed works by September 1. That list will depend on Anthropic disclosing internal records. It is not an independent audit. Without a cryptographic “bill of materials” for AI training data, which does not exist yet, outside parties have to trust the company’s records. That’s part of the problem.

Could Anthropic remove a million books from the list to reduce the damages? In theory, yes. Who would know? This is where discovery, subpoenas, and forensic analysis come in. There is no perfect system. We are asking a company that allegedly pirated books to police itself during the lawsuit.

Why This Case Matters Beyond Books

This lawsuit is not just about books. It is about copyright law struggling to catch up with AI.

Think about website owners. AI scrapes websites and uses their content to summarize information. That directly interferes with how websites earn revenue. If AI answers a user’s question, the user may never visit the site that provided the original information. That should matter legally.

Fair use requires that the new use of content does not replace the original market. AI summarizing an article might fall into a gray area. AI crawling a website, copying the content, and repackaging it into an answer, without driving traffic back to the source, starts to look more like direct competition. That is not fair use. That is displacement.

Let’s be honest: AI did not invent this behavior. Years ago, you could use software to download an entire website for offline viewing. The difference? That software still displayed the ads on the pages. AI removes the middle step. The content is repackaged without credit, compensation, or traffic.

Creative Commons, Free Content, and Consent

Another issue this lawsuit raises is the idea of consent. Some books are released under Creative Commons. Some are given away for free. That does not mean corporations have the right to use them for AI training without permission. “Free to read” is not the same as “free to train on.” Copyright law and licensing still apply.

The Future: AI Isn’t Going Anywhere, But Laws Need to Catch Up

AI is here to stay. That does not mean companies get to take whatever data they want and call it innovation.

This lawsuit will help set a precedent for how AI companies use copyrighted material. The court is sending a message: buying the book is one thing, pirating it is another. That rule should apply to website owners, news outlets, forums, and anyone else creating content online.

Companies like Anthropic need to start thinking about licenses, not loopholes. AI can’t be allowed to disrupt the entire content ecosystem without any accountability.

This case matters because it asks the right question: where is the line between innovation and exploitation?

If AI is trained on pirated books, that is not innovation. It’s theft. The bigger challenge is proving exactly what was used. AI training data is complex, massive, and mostly private. The court is trying to create a system where authors can defend their rights anyway.

The outcome of this lawsuit will shape how AI companies handle content, not just for books but for everything online. Website owners, artists, writers, and anyone producing creative work should pay attention. The law is being written in real time, and this case is part of that process.

The Anthropic Lawsuit: How Do We Know Which Books AI Really Used?

What Is the Anthropic Lawsuit About?

The Bigger Question: How Do We Know Which Books Were Actually Used?

Can Anthropic Be Trusted to Tell the Truth About Its Data?

Why This Case Matters Beyond Books

Creative Commons, Free Content, and Consent

The Future: AI Isn’t Going Anywhere, But Laws Need to Catch Up

Balancing Convenience and Privacy in the Age of Geolocation Services

Why We’re Falling for Virtual Entertainers

Is “BOOM” Proof That Humans and AI Can Create Art Together?

TikTok Is Being Sold, But Not the Part That Matters

The Dark Side of AI: Dependency and Mental Health Risks

Should We Trust an AI Browser That Installs at the Kernel Level?

The Anthropic Lawsuit: How Do We Know Which Books AI Really Used?

What Is the Anthropic Lawsuit About?

The Bigger Question: How Do We Know Which Books Were Actually Used?

Can Anthropic Be Trusted to Tell the Truth About Its Data?

Why This Case Matters Beyond Books

Creative Commons, Free Content, and Consent

The Future: AI Isn’t Going Anywhere, But Laws Need to Catch Up

You May Also Like