site banner

Culture War Roundup for the week of May 1, 2023

This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.

Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

  • Shaming.

  • Attempting to 'build consensus' or enforce ideological conformity.

  • Making sweeping generalizations to vilify a group you dislike.

  • Recruiting for a cause.

  • Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:

  • Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.

  • Be as precise and charitable as you can. Don't paraphrase unflatteringly.

  • Don't imply that someone said something they did not say, even if you think it follows from what they said.

  • Write like everyone is reading and you want them to be included in the discussion.

On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

9
Jump in the discussion.

No email address required.

Who is the ‘human’ in this example?

That's an entirely different question. Obviously the LLM is not itself a human, but neither is a typwriter or computer which a human uses as a tool to write something. So probably the copyright author would be the person who prompts the LLM and then takes its output and tries to publish it. Especially if they are responsible for editing its text and don't just copy paste it unchanged. You could make an argument that the LLM creator is the copyright holder, or that the LLM is responsible for its own output which is then uncopyrightable since it wasn't produced by a human.

But regardless of how you address the above question, it doesn't change my main point that the AI does not violate copyrights of humans it uses input from in any way differently from a human doing the same things that it does. Copyright law is complicated, but there's a long history and a lot of precedents and individual issues tend to get worked out. For this purpose, the LLM, or a human using LLM as an assistant, should be subject to the same constraints that human creators already are. They're not "stealing" any more or less than humans already do by consuming each other's work. You don't need special laws or rules or restrictions on it that don't already exist.

You can't reason by analogy with what humans do because LLMs are not human. They are devices, which contain data stored on media. If that data encodes copyrighted works, they are quite possibly copyright violations. If I memorize the "I have a dream" speech, the King estate can do nothing to me. They can bust me for reciting it in public, but I can recite in in private all I want (though I could get in trouble for writing it down). If I can ask an LLM for the "I have a dream" speech and it produces it, I have proven that the LLM contains a copy of the "I have a dream" speech and is therefore a copyright violation. And that's just the reproduction right; the derivative work right is even wider.

If I can ask an LLM for the "I have a dream" speech and it produces it, I have proven that the LLM contains a copy of the "I have a dream" speech and is therefore a copyright violation.

Except that LLM don't explicitly memorize any text, they generate it. It's the difference between storing an explicit list of all numbers 1 to 100 {1,2,3...100}, and storing a set of instructions: {f_n = n: n in [1,100]} that can be used to generate the list. It has a complicated set of relationships between words that it understands, and is very refined such that if it sees the words "Recite the "I have a dream" speech verbatim", it has a very good probability of successfully saying each of the words correctly. At least I think the better versions do, many of them would not actually get it word for word, because none of them have it actually memorized, they're generating it new.

Now granted, you can strongly argue, and I would tend to agree, that a word for word recitation by a LLM of a copyrighted work is a copyright violation, but this is analogous to being busted for reciting it in public. The LLM learning from copyrighted works is not a violation, because during training it doesn't copy them, it learns from them and changes its own internal structure in ways that improve its generating function such that it's more capable of producing works similar to them, but does not actually copy them or remember them directly. And it doesn't create an actual verbatim copy unless specifically asked to (and even then is likely to fail because it doesn't have a copy stored and has to generate it from its function)

Imagine I create some wacky compression format that will encode multiple copyrighted images into a single file, returning one of them when you hit it with an encryption key corresponding to the name of the image -- the file "contains" nothing but line noise, but if you run it through the decompressor with the key "Mickey Mouse" it will return a copyright violation.

Is passing this file around on Napster (or whatever the kids are doing these days) a copyright violation?

I'm pretty sure it is, if that's the intended use case of that file, and people other than you know about the decryption method. On the other hand, literally any data file of a certain length (call it A) can be turned into literally any other data file of the same length (B) if you hit it with exactly the right "decryption" (B-A) by just adding the bits together. So if you take this idea too far, every file is secretly an encrypted Mickey Mouse to the right code.

There's something nontrivial in here about information theory. If the copyrighted image has 500 kb of data, and your "encrypted file" is 500 kb, and the decryption key "Mickey Mouse" is 12 bytes, then clearly the file must contain the copyright violation. If you make an "encrypted file" with 12 kb and some wacky compression algorithm that requires 500 kb to encode and is specifically designed to transform the string "Mickey Mouse" into a copyrighted image, then yeah, that algorithm is a copyright violation.

On the other hand, if you use a random number generator to generate a random 500 kb number A, and then compute C = (B - A) where B is your copyrighted image, then in isolation both A and C are random numbers. If you just distribute A and nobody has any way of knowing or guessing C, then no copyright violation has occurred. If you just distribute C and nobody has any way of knowing or guessing A, then no copyright violation has occurred. But if you distribute them together, or if you distribute one and someone else distributes the other, or if one of them is a commonly known or guessable number, then you're probably violating copyright and trying to get away on a technicality.

But it's not enough for it to simply be possible to "decrypt" something into another thing. A string of pure 0s can be "decrypted" into any image or text. A word processor will generate any copyrighted text if the user presses the right keys in the right combination. I think there has to be some level of intent or ease or information theory value such that the file is doing the majority of the work.

So I'll concede that you make a LLM that will easily reproduce copyrighted material with simple descriptions and passwords, then I can see there being issues there. Similar to how if an author keeps spitting out blatant ripoffs of copyrighted works with a couple of words changed they'll get in trouble. But simply having used them in the training material is not itself a copyright violation. A robust LLM that has trained on lots of copyrighted materials but refuses to replicate them verbatim is not a copyright violation simply for having learned from them (which seems to be the primary objection that artists are having, not the actual reproduction of their work which I would agree is bad).

Indeed, I claim that's closer to description than analogy. An LLM is a way of encoding (lossily) a whole lot of textual data in a very opaque form, in a way that you can get much of that data out by giving fairly intuitive prompts.

From 17 USC 101

“Copies” are material objects, other than phonorecords, in which a work is fixed by any method now known or later developed, and from which the work can be perceived, reproduced, or otherwise communicated, either directly or with the aid of a machine or device.

Thus an LLM which can reproduce a copyrighted work is a copy of it, regardless of whether it is "generating" it or not.

According to your interpretation a pencil and a blank notebook would be a copy of a copyrighted work - it is entirely possible to use that pencil to write down a copy of the script to a recent Hollywood movie, which means that any blank piece of media is effectively a copy of any copyrighted work which can be expressed upon it.

No, the blank notebook is clearly not a copy of a copyrighted work, until you actually write down the script; the work is not fixed in it. There's no way to get the blank notebook to reproduce the copyrighted work without putting it on the notebook yourself. If you could write "The Fabelmans" at the top of the blank notebook and the rest of the script would appear, THEN you could validly argue the (apparently) blank notebook is a copy of a copyrighted work.

I don't think there's actually a distinction here. From my experience working and experimenting with ChatGPT, actually getting it to write down the original script of something like that without errors or hallucinations is much more involved than just writing the name at the top of the page. You can make ChatGPT recreate a script with a bunch of wrangling but I just don't see as much of a distinction between doing that and just writing out the script with a pen and paper. I don't see a way to get from trained ChatGPT to Actual, Accurate Copy of a Script that doesn't also incriminate blank paper.

I don't think there's actually a distinction here.

That's utterly ridiculous. There's an enormous difference between a device that will display a copy of a work after someone writes the work onto it and one which will do so if you just ask it. For the purpose of copyright law, errors and hallucinations don't matter if the result is close enough; a slightly corrupted copy is still a copy.