site banner

Culture War Roundup for the week of May 1, 2023

This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.

Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

  • Shaming.

  • Attempting to 'build consensus' or enforce ideological conformity.

  • Making sweeping generalizations to vilify a group you dislike.

  • Recruiting for a cause.

  • Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:

  • Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.

  • Be as precise and charitable as you can. Don't paraphrase unflatteringly.

  • Don't imply that someone said something they did not say, even if you think it follows from what they said.

  • Write like everyone is reading and you want them to be included in the discussion.

On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

9
Jump in the discussion.

No email address required.

The Writer's Guild of America (WGA) is on strike as of May 2nd, after negotiations with the Alliance of Motion Picture and Television Producers (AMPTP) broke down. While most of their demands deal with the way pay and compensation in the streaming era is structured, on the second page towards the bottom is:

ARTIFICIAL INTELLIGENCE

  • WGA PROPOSAL: Regulate use of artificial intelligence on MBA-covered projects: AI can’t write or rewrite literary material; can’t be used as source material; and MBA-covered material can’t be used to train AI.
  • AMPTP OFFER: Rejected our proposal. Countered by offering annual meetings to discuss advancements in technology.

I think this is an interesting first salvo in the fight over AI in creative professions. While this is just where both parties are starting for strike negotiations, and either could shift towards a compromise, I still can't help but see a hint that AMPTP isn't super interested in foregoing the use of AI in the future.

In 2007, when the WGA went on strike for 3 months, it had a huge effect on television at the time. There was a shift to unscripted programming, like reality television, and some shows with completed scripts that had been on the back burner got fast tracked to production. Part of me doubts that generative AI is really at the point where this could happen, but it would be fascinating if the AMPTP companies didn't just use traditional scabs during this strike, but supplemented them with generative AI in some way. Maybe instead of a shift to reality television, we'll look back on this as the first time AI became a significant factor in the production of scripted television and movies. Imagine seeing a "prompt engineer" credit at the end of every show you watch in the future.

It'll be interesting to see how this all plays out.

and MBA-covered material can’t be used to train AI

Is this even legal? AFAICT there’s no abstract ownership of concepts or ideas that copyright holders can claim, only claims against produced works. So a copyright holder can sue someone who uses AI to generate similar content to what is copyrighted, but not for using a work as training data per se. Sounds like the writers should be picketing Congress too.

I'd argue that a neural net is a derivative work of its training data, so its mere creation is a copyright violation.

But you could make a similar argument that a human brain is a derivative work of its training data. Obviously there are huge differences, but are those differences relevant to the core argument? A neural net takes a bunch of stuff it's seen before and then combines ideas and concepts from them in a new form. A human takes a bunch of stuff they've seen before and then combines ideas and concepts from them in a new form. Copyright laws typically allow for borrowing concepts and ideas from other things as long as the new work is transformative and different enough that it isn't just a blatant ripoff. Otherwise you couldn't even have such a thing as a "genre", which all share a bunch of features that they copy from each other.

So it seems to me that, if a neural net creates content which is substantially different from any of its inputs, then it isn't copying them in a legal sense or moral sense, beyond that which a normal human creator who had seen the same training data and been inspired by them would be copying them.

The dystopian take is obviously that the copyright lawyers will come for the brain next: experiencing copyrighted media without paying for it will be criminalized.

Who is the ‘human’ in this example?

That's an entirely different question. Obviously the LLM is not itself a human, but neither is a typwriter or computer which a human uses as a tool to write something. So probably the copyright author would be the person who prompts the LLM and then takes its output and tries to publish it. Especially if they are responsible for editing its text and don't just copy paste it unchanged. You could make an argument that the LLM creator is the copyright holder, or that the LLM is responsible for its own output which is then uncopyrightable since it wasn't produced by a human.

But regardless of how you address the above question, it doesn't change my main point that the AI does not violate copyrights of humans it uses input from in any way differently from a human doing the same things that it does. Copyright law is complicated, but there's a long history and a lot of precedents and individual issues tend to get worked out. For this purpose, the LLM, or a human using LLM as an assistant, should be subject to the same constraints that human creators already are. They're not "stealing" any more or less than humans already do by consuming each other's work. You don't need special laws or rules or restrictions on it that don't already exist.

You can't reason by analogy with what humans do because LLMs are not human. They are devices, which contain data stored on media. If that data encodes copyrighted works, they are quite possibly copyright violations. If I memorize the "I have a dream" speech, the King estate can do nothing to me. They can bust me for reciting it in public, but I can recite in in private all I want (though I could get in trouble for writing it down). If I can ask an LLM for the "I have a dream" speech and it produces it, I have proven that the LLM contains a copy of the "I have a dream" speech and is therefore a copyright violation. And that's just the reproduction right; the derivative work right is even wider.

If I can ask an LLM for the "I have a dream" speech and it produces it, I have proven that the LLM contains a copy of the "I have a dream" speech and is therefore a copyright violation.

Except that LLM don't explicitly memorize any text, they generate it. It's the difference between storing an explicit list of all numbers 1 to 100 {1,2,3...100}, and storing a set of instructions: {f_n = n: n in [1,100]} that can be used to generate the list. It has a complicated set of relationships between words that it understands, and is very refined such that if it sees the words "Recite the "I have a dream" speech verbatim", it has a very good probability of successfully saying each of the words correctly. At least I think the better versions do, many of them would not actually get it word for word, because none of them have it actually memorized, they're generating it new.

Now granted, you can strongly argue, and I would tend to agree, that a word for word recitation by a LLM of a copyrighted work is a copyright violation, but this is analogous to being busted for reciting it in public. The LLM learning from copyrighted works is not a violation, because during training it doesn't copy them, it learns from them and changes its own internal structure in ways that improve its generating function such that it's more capable of producing works similar to them, but does not actually copy them or remember them directly. And it doesn't create an actual verbatim copy unless specifically asked to (and even then is likely to fail because it doesn't have a copy stored and has to generate it from its function)

Imagine I create some wacky compression format that will encode multiple copyrighted images into a single file, returning one of them when you hit it with an encryption key corresponding to the name of the image -- the file "contains" nothing but line noise, but if you run it through the decompressor with the key "Mickey Mouse" it will return a copyright violation.

Is passing this file around on Napster (or whatever the kids are doing these days) a copyright violation?

More comments

From 17 USC 101

“Copies” are material objects, other than phonorecords, in which a work is fixed by any method now known or later developed, and from which the work can be perceived, reproduced, or otherwise communicated, either directly or with the aid of a machine or device.

Thus an LLM which can reproduce a copyrighted work is a copy of it, regardless of whether it is "generating" it or not.

More comments

Once again it is unclear to me how this is very different than a human who reads a bunch of scripts/novels/poems and then produces something similar to what he studied.

There’s a lot of different ways you could look at it, but I think I might just say that the principle of “if you use someone else’s work to build a machine that replaces their job, then you have a responsibility to compensate that person” just seems axiomatic to me. To say that the original writers/artists/etc are owed nothing, even though the AI literally could not exist without them, is just blatantly unfair.

Is it not different from the early factory laborers buildings the machines that would replace them? Or maybe more aptly the carriage companies that supplied the ford factories. They were paid for the production fairly enough. That was the narrow agreement, not that no one else could be inspired by it or build an inspiration machine. To be replaced by people who were inspired by your works is the fate of every artist in history, at least those that didn't wallow in obscurity.

Is it not different from the early factory laborers buildings the machines that would replace them?

They consented and were paid. It's not analogous at all.

They produced the media, which is being consumed and paid for under the current payment model. They are being compensated for it regularly and under all the fair and proper agreements. The AI is trained off of the general inspiration in the air, which is also where they artists pulled their own works for. It's a common resource emitted by everyone in the culture to some degree. The Disney corporation did not invent their stories from whole cloth, they took latent myths and old traditional tales from us and packaged it for us and the ideas return to us. Now we're going to have a tool that can also tap this common vein and more equitably? This is democratization. This is equality. This is progress.

Last week I not so popularly defended copyright, and I still believe it's the best compromise available to us. But it doesn't exist because of a fundamental right, it exists because it solves a problem we have with incentivizing the upfront cost of creating media. If these costs can be removed from the equation then the balance shifts.

Last week I not so popularly defended copyright, and I still believe it's the best compromise available to us. But it doesn't exist because of a fundamental right

How do you feel about software license agreements? Plenty of software/code is publicly visible on the internet and can be downloaded for free, but it's accompanied by complex licensing terms that state what you can and can't do with it, and you agree to those terms just by downloading the software. Do you think that license agreements are just nonsense? Once something is out on the internet then no one can tell you what you can and can't do with it?

If you think that once a sequence of bits is out there, it's out there, and anyone can do anything with it, then it would follow that you wouldn't see anything wrong with AI training as well.

More comments

Because the law doesn't consider your memory or the arrangement or functioning of your neurons to be a tangible medium of expression, but a computer memory (including RAM, flash, spinning rust, paper tape, whatever) is. If you see a work and produce something similar, what you produce might indeed by a copyright violation (and there are a lot of lawsuits about that), but your mind can't itself be a copyright violation. A neural network in a computer can be.

A neural network in a computer can be.

This doesn't seem to follow either. Maybe what the computer produces could be a violation, but the actual information contained within it doesn't resemble the copyrighted material in any way we can determine. At least that's based on my understanding of how the trained AI works.

“Copies” are material objects, other than phonorecords, in which a work is fixed by any method now known or later developed, and from which the work can be perceived, reproduced, or otherwise communicated, either directly or with the aid of a machine or device.

The fact that the information contained within the LLM doesn't, in some sense, resemble the copyrighted material isn't relevant, nor should it be; the fact that we can get it out demonstrates that it is in there.

Which is why I'm confused as to why this does not apply to a human memory, other than that just being the Court's distinction between human hardware and electronic systems that they apply.

And then, even accepting your point:

the fact that we can get it out demonstrates that it is in there.

Then presumably putting sufficient controls on the system so that it WILL NOT produce copyrighted works on demand solves the objection.

We also run into the 'library of Babel" issue. If you have a pile of sufficiently large randomized information, then it probably contains 'copies' of various copyrighted works that can be extracted.

So an AI that is trained on and 'contained' the entire corpus of all human-created text might be said to contain copies of various works, but the incredible, vast majority of what it it contains is completely 'novel,' unrelated information which users can generate at will, too.

Which is why I'm confused as to why this does not apply to a human memory, other than that just being the Court's distinction between human hardware and electronic systems that they apply.

There may be no other distinction.

Then presumably putting sufficient controls on the system so that it WILL NOT produce copyrighted works on demand solves the objection.

Only if the controls are inseparable from the system. If I can take your weights and use them in an uncensored system to produce the copyrighted works, they were still in there. Just as if you rigged a DVD player not to play a DVD of "Return of the Jedi" wouldn't mean "Return of the Jedi" wasn't on the DVD.

We also run into the 'library of Babel" issue. If you have a pile of sufficiently large randomized information, then it probably contains 'copies' of various copyrighted works that can be extracted.

If the randomized information was generated without reference to the copyrighted works, this doesn't matter. That's not the case with the AIs; they had the copyrighted works as training data.

More comments

I believe that's unclear actually, it's the crux of many a ongoing lawsuits including the one against GitHub.

I see convincing cases both that this is similar to human learning and that it's plagiarism.

Ultimately it is the legislature that's going to have to intervene to settle this, otherwise the courtd will follow the law in whatever silly direction it'll go when applied to an object it was never intended to deal with.

Personally I'd be glad to see copyright effectively abolished for essentially anarchist reasons. But I would rather we do this on purpose than by accident.

I think this is something that has to get figured out fast. I agree that right now, they can't really regulate that. At most the AMPTP could themselves agree not to train their own AI instances on it, but yeah, what's stopping you from using AI's already trained on that.

I have stated elsewhere, that I'm personally disposed toward supporting some kind of expansion of copyright-adjacent rights that includes training rights. But that would have to be a somewhat globally coordinated legal effort.

have stated elsewhere, that I'm personally disposed toward supporting some kind of expansion of copyright-adjacent rights that includes training rights.

Yes, and those training rights should be called universal basic income. Every voice in culture contributes to training these AIs just like the voices given the largest audience. They funnel the culture we all produce into media and are supported by us. If the ultimate storytelling machine is going to be produced from our culture it belongs to us all, the specific creatives who used to do this labor were fairly compensated.

Of course it is legal. Parties can always contract to provide greater protections than the minimum provided by copyright law. Of course, those protections only apply to the parties to the agreement.

Guess that would mean no WGA content on broadcast TV or YouTube clips?