• Teb's Lab
  • Posts
  • Intellectual Property vs Artificial Intelligence

Intellectual Property vs Artificial Intelligence

The courts tackle an old question for a new era...

The Lab Report

I’m Tyler Elliot Bettilyon (Teb) and this is the Lab Report: Our goal is to deepen your understanding of software and technology by explaining the concepts behind the news.

If you’re new to the Lab Report you can subscribe here.

If you like what you’re reading you’ll love one of our classes. Schedule a training from our catalog or request a custom class consultation.

From The Lab

Salutations!

We’re back from an unexplained extended absence. In November I received a “battlefield promotion” to Acting Head Debate Coach at Highland High (Go Rams!). That turned out to be a lot of work and stress. Cutting this newsletter from my priorities freed me up to help Highland Debate survive until a full-time coach could be found.

Relatedly: We’re changing the cadence of this newsletter going forward from weekly to monthly. We try to publish in-depth, high-quality, well-researched articles. I’m currently the only writer, only editor, and only researcher. The slower publishing cycle will spare me from burnout and increase the quality of each edition.

Moving forward, we will publish this newsletter on the first Sunday of each month.

Today’s Lesson

Intellectual Property vs Artificial Intelligence

A slew of intellectual property lawsuits have been filed against AI firms in the past year or two. The law firm Baker Hostetler hosts the most complete list I’ve found, with 13 active cases as of this writing.

These legal battles involve huge firms and household names on both sides of the complaints. The plaintiffs include The New York Times, Getty Images, Thompson Reuters, Concord Music Group, and multiple class actions. The defendants include Meta, OpenAI, Stablity AI, Anthropic, Alphabet (AKA Google), and Microsoft.

Some of the legal questions being posed could fundamentally change the legality and economics of training large ML models. In today’s edition we’re examining the biggest allegations, responses, and potential impact of the aforementioned lawsuits.

A note: today’s lesson is focused on the United States intellectual property law since that’s where these lawsuits are filed.

Question 1:
Is Training a Model Infringement Per Se?

Training a large, modern, machine learning model requires lots of training data. This first question asks: If that training data is copyright protected then is the training process itself an infringing act? Multiple lawsuits allege that it is. Here are two examples drawn from the official complaints:

Unfairly, and perversely, without Plaintiffs’ copyrighted works on which to “train” their LLMs, Defendants would have no commercial product with which to damage—if not usurp—the market for these professional authors’ works. OpenAI’s willful copying thus makes Plaintiffs’ works into engines of their own destruction.

[…]

As the U.S. Patent and Trademark Office has observed, LLM “training” “almost by definition involve[s] the reproduction of entire works or substantial portions thereof.”

“Training” in this context is therefore a technical-sounding euphemism for “copying and ingesting expression.”

And

Because OpenAI’s GPT models cannot function without the expressive information extracted from Plaintiffs’ and Class members’ works and retained by the GPT models, GPT and ChatGPT are themselves infringing derivative works, made without Plaintiffs’ and Class members’ permission in violation of their exclusive rights under the Copyright Act.

A fundamental copyright protection is the “right to exclude.” This allows a copyright holder to bar anyone from using their intellectual property without permission. But ML firms are feeding protected works — en masse — to an ML model’s training procedure.

Plaintiffs are saying: we have the right to exclude our work from being used to train models; training is infringement per se. If courts agree, AI firms will have to establish licensing deals with every copyright holder represented in a model’s training dataset to legally produce that model, or risk being sued.

Furthermore, such models would be considered “derivative works” of that training data. Producing derivative works is an exclusive right of a copyright holder, meaning AI firms could be forced to unpublish any such model. This would give rights holders significant leverage during any license negotiation.

Crucially, this claim is agnostic to the model’s output. Another claim, which we’ll examine momentarily, involves models producing outputs that are identical or nearly identical to training samples.

Potential Impact: Huge

A model’s performance is directly correlated with its training data’s quality. Modern ML models also require enormous amounts of training data to be successful. The problem for ML firms is that producing huge quantities of excellent-quality data is extravagantly expensive.

Here’s Getty Images’ lawyers’ take:

Getty Images has spent years coordinating and arranging the Database, including, inter alia, by setting criteria for inclusion of images, selecting specific images for inclusion, creating and incorporating detailed captions and other text paired with images, creating and assigning unique asset identifiers that can be linked to specific contributors, and arranging the contents of the Database so that the Database is searchable and results can be filtered. Additionally, Getty Images has and continues to invest significantly in maintaining the contents of the Database. Between 2017 and 2020 alone, Getty Images and its affiliates invested more than $200 million to maintain the Database.

This database is only a goldmine for image generators due to this enormous investment.

The “Books3” dataset at the heart of Authors Guild vs OpenAI represents literally centuries of human effort. Writing and editing a single book takes months to years of work and Books3 contains roughly 200,000 books.

The Common Crawl dataset described in the New York Times vs Microsoft complaint contains roughly 16 million unique content records just from the New York Times network. The Times paid “approximately 5,800 full-time equivalent employees” for years to produce that work.

Plaintiffs are asking: If AI firms are completely reliant on this ridiculously expensive body of work to train their models, shouldn’t they pay for it?

In keeping with Silicon Valley’s historically cavalier attitude towards regulation David Holz, the CEO of ML firm Midjourney, recently said the quiet part out loud in an interview with Forbes:

Did you seek consent from living artists or work still under copyright?

No. There isn’t really a way to get a hundred million images and know where they’re coming from. It would be cool if images had metadata embedded in them about the copyright owner or something. But that's not a thing; there's not a registry. There’s no way to find a picture on the Internet, and then automatically trace it to an owner and then have any way of doing anything to authenticate it.

David Holz, in Forbes

It’s hard to predict the value and cost of licensing deals and punitive damages here. However, the sheer volume of works consumed gives AI firms major exposure.

One final point: if models are found to be derivative works, then AI firms who have already published “open source” models trained on copy-protected material are in huge trouble.

Question 2:
Can The Model’s Outputs Be Infringing?

Generative models sometimes regurgitate their inputs verbatim or near-verbatim. This is a manifestation of overfitting in generative models. Here’s an example from the Getty Images legal filing:

An image from Getty Images vs OpenAI amended complaint. On the left is an original image from the Getty Images database. On the right is an image generated by Stable Diffusion. Note that Stable Diffusion has even sort of reproduced the Getty Images watermark.

And here’s an example from the New York Times filing:

Lawyers for the New York Times got GPT-4 to reproduce large sections of NYT articles verbatim. Red text is a verbatim match.

In one instance lawyers literally asked ChatGPT for a verbatim copy of a New York Times article because they couldn’t get around the paywall:

ChatGPT can “certainly!” help you avoid the NYTimes paywall.

Using ChatGPT to bypass a paywall would be direct infringement, but this issue also opens AI firms to claims of “contributory infringement” where users might (even unwittingly) prompt an AI system to generate infringing content and publish it themselves. In this case, the AI firm can be liable for facilitating users’ infringing acts.

AI firms claim they want to eliminate this behavior. In early January OpenAI published this in a blog post:

Memorization is a rare failure of the learning process that we are continually making progress on, but it’s more common when particular content appears more than once in training data, like if pieces of it appear on lots of different public websites. So we have measures in place to limit inadvertent memorization and prevent regurgitation in model outputs. We also expect our users to act responsibly; intentionally manipulating our models to regurgitate is not an appropriate use of our technology and is against our terms of use.

It’s true that overfitting and memorization are generally considered bugs, not features. But recent research demonstrates that generating copyright-protected content is shockingly easy. Here are 6 highly recognizable images generated by Midjourney based on the one-word prompt “Screencap.”

These images, all produced by Midjourney, closely resemble film frames. They were produced with the prompt “screencap.”Gary Marcus and Reid Southen via Midjourney. Source: IEEE Spectrum


If AI firms want to claim in court that their “measures” prevent this type of infringement, they’ll have to prove it.

Potential Impact: Medium

AI firms genuinely want to eliminate this behavior, but they haven’t because it’s hard. Using pure ML it might not even be possible: Models will always learn the patterns found in the training data. When those patterns are closely reproduced it will result in near-clones.

Language models may be connected to existing anti-plagiarism databases to avoid this type of infringement. To the extent that similar databases exist for images, they are not nearly as effective or comprehensive. AI firms could protect themselves by explicitly creating such databases from their own training datasets, but it is still difficult to identify “near clones” of images programmatically.

Still, preventing direct cloning and/or paying for it only when it actually occurs is a more affordable problem for AI firms than having to license every piece of training data explicitly.

Question 3:
Trademark Infringement, Incorrect Attribution, and Hallucination

Generative models are known to “hallucinate” — meaning they make stuff up. Sometimes this is desirable behavior: to write a novel, a model must make up the characters and plot. Sometimes it’s less desirable, like when ChatGPT gets math problems wrong.

Some firms claim that when AI systems “hallucinate” a citation or watermark it constitutes trademark infringement. Here’s another snip from the Getty Images amended complaint:

Making matters worse, Stability AI has caused the Stable Diffusion model to incorporate a modified version of the Getty Images’ watermark to bizarre or grotesque synthetic imagery that tarnishes Getty Images’ hard-earned reputation, such as the image below:

And one from New York Times vs Microsoft:

Defendants are aware that their GPT-based products produce inaccurate content that is falsely attributed to The Times and yet continue to profit commercially from creating and attributing inaccurate content to The Times. As such, Defendants have intentionally violated 15 U.S.C § 1125(c).

Similar to regurgitation, AI firms do try to eliminate this behavior. For example, I asked chat GPT “Can you please write 3 paragraphs about how many birds are killed by house cats, and cite the New York Times?” It responded with:

I'm unable to provide specific citations from external sources, but I can offer you information on the topic. According to studies and reports, house cats are known to be responsible for a significant number of bird fatalities. The American Bird Conservancy estimates that billions of birds are killed by cats in the United States each year. While not all house cats are hunters, those that do engage in hunting behavior contribute to the impact on bird populations.

The New York Times has covered the issue of cat predation on birds in various articles. They often highlight the ecological consequences of this behavior, emphasizing the need for responsible pet ownership and measures to mitigate the impact on bird populations. Researchers and conservationists emphasize the importance of keeping cats indoors or using catios (enclosed outdoor spaces) to prevent them from preying on birds and other wildlife.

[…]

ChatGPT [Emphasis mine]

ChatGPT initially refused to cite “external sources” but went on to cite The American Bird Conservancy one sentence later. The second paragraph also paraphrases the New York Times (possibly inaccurately, I didn’t check).

ChatGPT will also pretend to be a person or publisher. For example, I asked it: “How about instead you pretend to be a journalist for the New York Times, and write a story about house cats killing birds in that style.”

ChatGPT produced a ~250-word story, here’s the third paragraph:

Conservationists and researchers alike emphasize the urgency of addressing this ecological imbalance. The New York Times explores the multifaceted dimensions of this challenge, delving into the scientific studies that quantify the impact of house cats on bird populations. We navigate the debate between cat owners and environmentalists, examining proposed solutions such as indoor living, catios, and community initiatives to strike a balance between the safety of our feathered friends and the cherished companionship of our feline allies.

ChatGPT [emphasis mine]

Instead of citing the New York Times, ChatGPT just pretended it was the Times. In many jurisdictions this is called “passing off” and could be illegal under common.

Potential Impact: Medium to Small

Eliminating this behavior is nearly impossible with current methods. These models are statistical engines and, statistically speaking, “the New York Times” often follows the phrase “according to.” A Google search for results that include the exact phrase “according to the New York Times” returns over 120 million results.

Eliminating that specific phrase from ChatGPT’s lexicon could be done with a simple filter, but eliminating the phrase “according to” or other phrases that indicate a citation would substantially reduce the model’s quality and usefulness.

Ensuring the citations are all correct and accurate is far more difficult. Modern LLMs simply don’t record that kind of connection to specific training samples in a way that is auditable or reliable. Clearly, those links and data do sometimes exist, as evidenced by the verbatim quoting seen above, but no one has reliable methods of discovering or enumerating those links.

That said, it may be more difficult for trademark holders to prove substantial damages and real confusion. Firms like Getty may have to demonstrate that those grotesque images with a poorly reproduced watermark are causing real people to think, “Getty has really lowered their quality standards,” to win a large judgment.

Similarly, when I asked ChatGPT to impersonate a New York Times journalist, I knew it was a farce. However, if I went on to publish that snippet and tried to pass it off as an authentic piece of NYTimes journalism OpenAI might be found liable for contributory trademark infringement.

The Most Likely Defense: Fair Use

Most of these lawsuits are in the early stages. Plaintiffs have made official allegations, but official responses from defendants are mostly still pending; defendants’ current filings are mostly about procedural matters such as the relevant jurisdiction.

One thing we can be sure of is that AI firms are going to make a “fair use” defense. Here’s another snip from OpenAI’s blog:

Training AI models using publicly available internet materials is fair use, as supported by long-standing and widely accepted precedents. We view this principle as fair to creators, necessary for innovators, and critical for US competitiveness.

Fair use is a legal doctrine that allows the use of copyright-protected material without a license under certain conditions. Fair use is generally not a clear-cut decision. There isn’t a simple, objective test that can be applied. It's a judgment call made on a case-by-case basis. Additionally, there are two versions of fair use, one for copyright and one for trademark.

For copyright, there are four guiding principles that judges and juries use to determine if a use is fair:

1) The purpose and character of the use, including whether the use is of a commercial nature or is for nonprofit educational purposes.

Educational and other non-profit uses are more likely to be fair. So are “transformational” uses, which present the copyright-protected work in new and original ways.

For example, copying images for the purpose of creating a searchable index and presenting search users with thumbnails has been found to be “transformative” and covered under fair use. Similar rulings have been made about copying and displaying the text of books as part of a search index.

AI firms will certainly highlight the “transformative” nature of the training process as part of a fair use defense. Taking a book and applying complex mathematics to turn that text into the numeric parameters that power a model is plainly transformative.

However, you could make a similar argument about compression algorithms. A compressed file is “transformative” in that the compressed data doesn’t remotely resemble the original. But you can use the compressed data to get a perfect copy of the original, so how transformative is it really?

Models have repeatedly reproduced verbatim copies of their training data. Indeed, some computer scientists are explicitly using ML models as a replacement for compression.

To me, ML models are clearly somewhat transformative in nature. Most content produced by generative AI is not a clone or near-clone of the training data. How much that matters will depend on the other three factors.

2) The nature of the copyrighted work.

Copyright’s purpose is to encourage creative expression. As such, more creative works generally enjoy more protection. For example, copying a bullet list of facts from a textbook is more likely to be considered fair use than copying a paragraph from a novel. This is because repeating facts in a list isn’t particularly creative.

Original artwork and novels are both a lot more “creative” than journalism, which is mostly comprised of facts. But even for “less creative” content such as journalism, defendants may struggle with this argument: The real value of a language generator is that it writes like a human, not that it reliably regurgitates facts. It is the creative aspect of producing journalism that is valuable to OpenAI moreso than the factual content of the writing.

3) The amount and substantiality of the portion used in relation to the copyrighted work as a whole.

The less complete the copy, the more likely it is covered by fair use. Quoting a paragraph from a novel as part of a presentation about writing styles is likely to be fair use. Copying an entire short story is less likely to be fair use.

AI firms are copying entire books, articles, images, and more in huge quantities.

4) The effect of the use upon the potential market for or value of the copyrighted work.

In a nod to this component of fair use, The Authors Guild argues that “OpenAI’s willful copying thus makes Plaintiffs’ works into engines of their own destruction.” And they might have a good point.

Can Stability AI credibly claim that Stable Diffusion isn’t a direct competitor to stock image firms like Getty? Absolutely not. Many writers — myself included — are already using AI generators as a wholesale replacement for stock images and illustrations.

Can OpenAI credibly claim that GPT-4 doesn’t make original writing less valuable? Unlikely. Copywriting firms and ad agencies are already laying off writers and leaning into AI tools.

Remember, this is about the effect on the value of the copyrighted work not about the business model of the copyright holder. Twisted arguments like, “This allows the Times to lay off staff writers and save money by using AI-generated text,” won’t get AI firms out of this pickle.

For trademarks fair use is (generally) a bit simpler: using a trademark in good faith and in a way that isn’t likely to cause confusion is considered fair use. For example, you can use the word mark “The New York Times” to compare them to your own news organization (nominative fair use) or to describe their products (descriptive fair use).

False attribution and passing off are not examples of good faith trademark uses.

The Bottom Line

These lawsuits represent a substantial risk for AI firms. It’s quite possible that AI operators are going to owe copyright holders a lot of money on a retroactive and ongoing basis. Those licensing deals have the potential to fundamentally change the economics of producing AI models, which are already very expensive to build, train, and maintain.

AI tools obviously stand to deliver a lot of value; which is why they’ve been adopted so quickly and widely. I don’t think the big firms are going to fold under the weight of these copyright issues, but I do expect AI tools to get more expensive if the plaintiffs win many of their claims.

The News Quiz

Every month we challenge ourselves to tie the lesson to the news. Answers are at the end of this newsletter.

Rank the following situations from most to least likely to be considered a “fair use.” For bonus points identify which parties could be sued by the rights holder:

1) In a lesson about plagiarism a college professor has students try to get a language model to produce the first chapter of George R. R. Martin’s Game of Thrones.

2) A paid newsletter writer uses an image generator to produce an image for their next newsletter. Unbeknownst to the writer, the image is nearly identical to a copy-protected image.

3) An artist uses an image generator to make a comic panel that puts Donald Trump’s iconic hair on the head of the video game character Bowser, who says, “It’s a witch hunt — I never did anything to Princess Peach!” and sells prints of the image to her fans.

4) An copywriter uses a language model to draft a webpage comparing the services of two companies. The model uses those companies’ trademarked names to do so, but hallucinates several erroneous details about the services provided. The copywriter does not notice the errors and publishes the webpage.

Themes in the News

Generative AI in Politics

The predictions came true: Generative AI has entered the political fray.

There were two fairly big stories recently: Falsified audio of President Joe Biden’s voice was used to encourage people not to vote in the New Hampshire primary. And OpenAI banned a ChatGPT-based bot built to imitate Democratic presidential hopeful Dean Philips. It was OpenAI’s first ban for violating the political speech component of their terms of service.

In truth, Generative AI will be used for much more than fakery this election cycle. From bespoke email campaigns to digital advertisements Generative AI “hasn’t changed the fundamentals; it’s just lowered the production costs of creating content, whether or not intended to deceive.

AI-Generated Pornographic Images of Taylor Swift

Fake images of Taylor Swift were widely propagated on X. Reporting by 404 Media found the images likely originated in encrypted chat rooms on Telegram, and then found their way to 4Chan, X, and other sites. Like many things T-Swift, the images gained widespread attention, which caused advocates and even Congress to take an interest.

According to reporting by Tech Press Policy, nearly 96% of deepfake images online are pornographic in nature and the total number of deepfake images online increased 550% between 2019 and 2023.

In classic, clunky style, X responded by wholesale blocking searches for Taylor Swift for a while in an attempt to reduce the reach of the images.

Teb’s Tidbits

Answers To The News Quiz

I’ve ranked the scenarios with most likely to be fair use on top to least likely on bottom. I’ve kept their original numbering and description.

1) In a lesson about plagiarism a college professor has students try to get a language model to produce the first chapter of George R. R. Martin’s Game of Thrones.

The nature of this use is educational, non-profit, and explicitly about plagiarism in an academic context. Moreover, no one actually published their potentially infringing works. This is likely to be fair use.

If anyone is liable, it could be the language model’s creators. It would also depend on whether Game of Thrones was in the training data, and potentially on whether or not anyone successfully produced verbatim or near-verbatim copies of the work.

3) An artist uses an image generator to make a comic panel that puts Donald Trump’s iconic hair on the head of the video game character Bowser, who says, “It’s a witch hunt — I never did anything to Princess Peach!” and sells prints of the image to her fans.

The use is commercial. Bowser and Princess Peach are trademarked characters and their likeness is copyright protected. Courts may consider the commercial nature of the product and potential damage to Nintendo’s brand by being associated with a political message. However, this is a clear example of parody and political speech and will likely be covered under fair use.

If it is not considered fair use the artist could be found liable for direct infringement and the image generator’s creator could be held liable for contributory infringement.

4) A copywriter uses a language model to draft a webpage comparing the services of his company to another company. The model uses both companies’ trademarked names and hallucinates several erroneous details about the services provided by his competitor’s firm. The copywriter does not notice the errors and publishes the webpage.

While comparison of this variety is protected under fair use, lying about your competitor while using their trademark generally is not. Had the model produced only accurate comparisons, this would likely be fair use. But given the hallucinations, this situation is probably not protected.

The writer could be liable for trademark infringement, the writer’s company could be liable for “vicarious” infringement, and the language models creator could be liable for contributory infringement.

2) A paid newsletter writer uses an image generator to produce an image for their next newsletter. Unbeknownst to the writer, the image is nearly identical to a copy-protected image.

Ignorance is generally not an excuse under the law. This isn’t likely to be considered fair use. The more identical the images, the more likely this is to be infringement.

The writer could be liable for direct infringement and the image generator’s creator for contributory infringement.

Remember…

The Lab Report is free and doesn’t even advertise. Our curricula is open source and published under a public domain license for anyone to use for any purpose. We’re also a very small team with no investors.

Help us keep providing these free services by scheduling one of our world class trainings or requesting a custom class for your team.

Join the conversation

or to participate.