Saturday, March 22, 2025

Training LLaMA using LibGen: Hack, a Theft, or Just Fair Use?

Imagine you're building a Large Language Model. You need data—lots of it. If you can find text data of high quality, vetted, truthful, and useful, it would be... great! So, naturally, you head online and find a treasure trove of books neatly indexed, conveniently downloadable, and completely free. The catch? You're looking at LibGen—one of the most infamous pirate libraries on the internet.

This isn't hypothetical. Recently, Meta made headlines for allegedly training their flagship LLM, LLaMA, on content from LibGen. But—can you even do that?

Let’s unpack the legal mess behind the scenes, step-by-step.

First: Is Using LibGen Even Legal?

Short answer: Absolutely not. Downloading copyrighted books from LibGen is textbook piracy. Think of it like grabbing a handful of snacks at the supermarket without paying—it's convenient but totally illegal.

Second: Does Training an AI Change the Equation?

Here’s where it gets fuzzy. In the U.S., you can claim "fair use"—the idea that some copying is permissible if you're transforming the original work into something new and valuable. (We covered this in an earlier blog post.

Remember the Google Books case? Google scanned millions of books without permission. Authors sued, but courts sided with Google, citing fair use. The logic was that indexing books for search purposes created something valuable without substituting the original.

Consider another example: the Authors Guild v. HathiTrust case. Libraries scanned books to help visually impaired readers and enable text search. Courts also ruled this fair use, emphasizing the transformative nature and public benefit. However, both these cases involved legally acquired copies—not pirated ones.

So, could Meta’s training of LLaMA fall under the same umbrella? Possibly, yes, claiming the same fair use approach. There is a subtle difference: Google used legally accessible copies (from libraries), while Meta reportedly took a different route. Legally speaking, when we talk about copyright and fair use in the US, the source of the copyrighted data does not directly affect the outcome. (Although it can affect the attitude of a jury or a judge if they believe that the defendant acted in bad faith.)

Third: What About the EU?

If you thought U.S. law was tricky, the EU adds another layer of complexity. They don’t have a broad "fair use" policy, but they've introduced exceptions specifically for Text and Data Mining (TDM). Good news for researchers and AI developers, right? Except there's a big "BUT": EU law explicitly requires lawful access. Pirate libraries like LibGen don't qualify.

In other words, in Europe, using LibGen isn't just risky—it's explicitly illegal.

Fourth: Is there a Legal Defense for using LibGen?

There is a very reasonable argument that training an AI is transformative—after all, an LLM doesn’t copy books; it learns from them. Consider also the LAION case from Germany. LAION, a nonprofit, scraped images from stock photo sites to train AI models. The court allowed it, but crucially because LAION had legitimate access and was a non-commercial entity. The outcome might differ sharply for a commercial giant sourcing pirated content.

There is also the counterargument from authors and publishers that LLMs themselves create (for competitive reasons) a market for licensing content, as the different LLM providers try to get access to exclusive, licensed content as a differentiating factor, in the same way that various streaming companies compete to get exclusive access to films, shows, and TV series. It is a bit of a circular argument (without free training of LLMs, can the LLMs get good enough to create a licensing market?), but we will have to wait for the courts to decide.

Fifth: What's the Risk Here?

For researchers at universities or small startups, casually using LibGen might seem harmless. The risks escalate quickly when you're a global company. Training on "presumed free" copyrighted data differs from "willful infringement"—the legal term for knowingly breaking copyright law. 

The fact that LLaMA is Open Source is a significant factor here, as there is less of a profit factor here, but when the trainer is a trillion-dollar company, the courts may behave differently. We will see...

After all, while pirates make great movie characters, they're generally less popular in courtrooms.

Monday, February 24, 2025

Copyright, Fair Use, and AI Training

[We tested the o1-pro model to give us a detailed analysis of the legal landscape around copyright and the use of copyrighted materials to train LLMs. The full discussion is available here. Below you will find a quick attempt to summarize the (much) longer report by o1-pro.]

What is Copyright? (And Why Should You Care?)

Imagine you spend months writing a book, composing a song, or designing a killer app—wouldn't you want some protection to stop someone from copying it and making money off your hard work? That’s where copyright steps in! It grants the copyright holder exclusive rights to reproduce, distribute, and display their work. However, copyright isn’t an all-powerful lock—there are important exceptions, like fair use, that allow for some unlicensed use, especially when it benefits society.

Copyright laws are all about balance. Too much restriction, and we block innovation and education. Too little, and creators lose their incentive to make new things. Governments step in to help find that sweet spot—protecting creators' rights while making sure knowledge, art, and innovation stay accessible.

The Fair Use Doctrine: When Borrowing is (Sometimes) Okay

Fair use is like the ultimate legal “it depends” clause in copyright law. It allows limited use of copyrighted materials without permission—whether for education, commentary, parody, or research. But how do you know if something qualifies as fair use? Courts consider these four big factors:

  1. Purpose and Character of the Use – Is the use transformative? Does it add new meaning or context? And is it for commercial gain or educational purposes?
  2. Nature of the Copyrighted Work – Is the original work factual (easier to use under fair use) or highly creative (harder to justify copying)?
  3. Amount and Substantiality – How much of the original is used, and is it the “heart” of the work?
  4. Effect on the Market – Does this use harm the copyright holder’s ability to profit from their work?

What Do Past Cases Tell Us About Fair Use?

Google Books Case (Authors Guild v. Google, 2015): Google scanned millions of books to make them searchable, showing only small snippets of text. The Second Circuit ruled this was fair use because:
  • It was highly transformative—it helped people find books rather than replacing them.
  • The snippets were not a market substitute—nobody was reading full books this way.
  • Instead of harming book sales, it actually helped readers find books to purchase.
Google Search Indexing (Perfect 10 v. Google, 2007): Google’s image search displayed thumbnail previews linking to full-size images. The Ninth Circuit ruled this was fair use because:
  • It served a different function—helping users find images, not replacing the originals.
  • Any market harm was speculative—there was no proof Google’s thumbnails hurt sales.
LinkedIn Scraping Case (hiQ Labs v. LinkedIn, 2019): hiQ Labs scraped publicly available LinkedIn profiles to analyze workforce data. LinkedIn sued, claiming this violated its terms of service. The Ninth Circuit ruled that scraping publicly accessible data wasn’t illegal under the Computer Fraud and Abuse Act (CFAA), but the case raised bigger questions about data ownership and fair use. This case matters for AI because it highlights the legal gray area of using publicly available content for AI training—does scraping data for machine learning function like search indexing (which courts favor) or unfairly compete with content creators?

When Courts Say “Nope” to Fair Use

When a copied work competes directly with the original, courts usually rule against fair use:

  • Texaco Case (American Geophysical Union v. Texaco, 1994) – Texaco photocopied journal articles for internal research. The court ruled this wasn’t fair use because Texaco could’ve just bought the licenses, and widespread copying threatened the scientific journal market.
  • Meltwater Case (Associated Press v. Meltwater) – Meltwater, a news aggregation service, copied AP excerpts. The court ruled this wasn’t fair use because it replaced a licensable market for news monitoring services.

How Does This Apply to AI Training?

AI models like ChatGPT train on huge datasets, including copyrighted text. Courts will likely analyze this under fair use principles by asking:

  • Is AI training transformative? AI companies argue that their models learn patterns rather than copying content. This mirrors Google Books, where scanning books for search indexing was deemed transformative.
  • Does AI-generated text replace the original? If AI can generate news summaries or books, it might compete with the markets for journalism, books, or educational content—similar to Meltwater replacing a paid service.
  • Is there a licensing market? If publishers and authors start licensing data for AI training, unlicensed use could be seen as market harm—like in Texaco, where academic publishers had a functioning licensing system.

The outcome of ongoing lawsuits will determine how courts see AI’s role in the content economy. If AI models start functioning as substitutes for original content, expect stricter copyright enforcement. If they’re seen as research tools, fair use might hold up.

Industry-Specific Market Harm Considerations

  1. News & Journalism – AI-generated summaries may reduce clicks on original articles, hurting ad revenue and subscriptions (New York Times v. OpenAI argues AI responses replace direct readership).
  2. Book Publishing – Authors claim AI-generated text could compete with traditional books and summaries (Authors Guild v. OpenAI argues AI models reduce demand for original works).
  3. Education & Academic Publishing – AI-generated study materials could cut into textbook sales (Pearson v. OpenAI claims AI-generated content could replace traditional textbooks).
  4. Creative Writing & Film – AI-generated scripts or novels could impact demand for human writers (Writers Guild v. OpenAI and Martin v. OpenAI argue AI mimicking authors threatens their markets).

The Future of AI and Copyright Law

Current lawsuits (New York Times v. OpenAI, Authors Guild v. OpenAI) will set precedents for AI copyright law. Possible outcomes include:

  • AI training as fair use – If courts find AI models transformative and non-substitutive.
  • AI training as infringement – If courts rule that it undermines a viable licensing market.
  • New licensing systems – Like how music royalties work, AI companies may have to pay creators.

Wrapping It Up

So, what’s the big takeaway? AI and copyright law are in a messy, ongoing battle. Will AI companies get a free pass under fair use, or will copyright holders demand licensing fees? We don’t know yet, but these decisions will shape the future of AI.

My bet? AI companies will create new markets where content creators can contribute and get paid—like YouTube does for video creators. Instead of just scraping data, AI firms will likely find ways to reward quality content, making it a win-win for tech and creatives alike.