site banner

Culture War Roundup for the week of January 16, 2023

This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.

Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

  • Shaming.

  • Attempting to 'build consensus' or enforce ideological conformity.

  • Making sweeping generalizations to vilify a group you dislike.

  • Recruiting for a cause.

  • Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:

  • Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.

  • Be as precise and charitable as you can. Don't paraphrase unflatteringly.

  • Don't imply that someone said something they did not say, even if you think it follows from what they said.

  • Write like everyone is reading and you want them to be included in the discussion.

On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

13
Jump in the discussion.

No email address required.

Some more heating up in the AI image generation culture wars, with stock image company Getty Images suing Stability AI over alleged copyright violations. Here's Getty Image's full press release:

This week Getty Images commenced legal proceedings in the High Court of Justice in London against Stability AI claiming Stability AI infringed intellectual property rights including copyright in content owned or represented by Getty Images. It is Getty Images’ position that Stability AI unlawfully copied and processed millions of images protected by copyright and the associated metadata owned or represented by Getty Images absent a license to benefit Stability AI’s commercial interests and to the detriment of the content creators.

Getty Images believes artificial intelligence has the potential to stimulate creative endeavors. Accordingly, Getty Images provided licenses to leading technology innovators for purposes related to training artificial intelligence systems in a manner that respects personal and intellectual property rights. Stability AI did not seek any such license from Getty Images and instead, we believe, chose to ignore viable licensing options and long‑standing legal protections in pursuit of their stand‑alone commercial interests.

This follows a separate class action lawsuit filed in California by 3 artists against multiple image generation AI companies including Stability AI, Midjourney, and DeviantArt (which is an art sharing site, but which seems to be working on building its own image creation model). According to Artnews, "The plaintiffs claim that these companies have infringed on 17 U.S. Code § 106, exclusive rights in copyrighted works, the Digital Millennium Copyright Act, and are in violation of the Unfair Competition law." It seems to me that these 2 lawsuits are complaining about basically the same thing.

IANAL, and I have little idea of how the courts are likely to rule on this, especially English courts versus American ones. I know there's precedent for data scraping being legal, but those are highly context-dependent, and e.g. the Google Books case was contingent on the product not being a meaningful competitor to the books that were being scanned, which is a harder argument to make about an AI image generator with respect to a stock image service. In my subjective opinion, anything published on the public internet is fair game for training by AI, since others learning from viewing your work is one of the things you necessarily accept when you publish your work for public view on the internet. This includes watermarked sample images of proprietary images that one could buy. However, there's a strong argument to be made for the other side, that there's something qualitatively different about a human using an AI to offload the process of learning from viewing images compared to a human directly learning from viewing images such that the social contract of publishing for public consumption as it exists doesn't account for it and must be amended to include an exception for AI training.

Over the past half year or so, I'm guessing AI image generation is second only to ChatGPT in mainstream attention that has been directed towards AI-related stuff - maybe 3rd after self-driving cars, and so it's unsurprising to me that a culture war has formed around it. But having paid attention to some of AI image generation-related subreddits, I've noticed that the lines still don't really fit with existing culture war lines. There's signs of coalescing against AI image generation in the left, with much of the pushback coming from illustrators who are on the left, such as the comic artist Sarah C Andersen who's one of the 3 artists in that class action lawsuit, and also a sort of leftist desire to protect the jobs of lowly paid illustrators by preventing competition. But that's muddled by the fact that, on Reddit, most people are on the left to begin with, and the folks who are fine with AI image generation tools (by which I mean the current models trained on publicly-available but sometimes copyrighted images) are also heavily on the left, and there are also leftist arguments in favor of the tech for opening up high quality image generation to people with disabilities like aphantasia and others. Gun to my head, I would guess that this trend will continue until it's basically considered Nazism within 2 years to use "unethically trained AI" to create images, but my confidence level in that guess would be close to nil.

From a practical perspective, there's no legislation that can stop people from continuing to use the models that are already built, but depending on the results of these lawsuits, we could see further development in this field slow down quite a bit. I imagine that it can and will be worked around, and restrictions on training data will only delay the technology by a few years, which would mean that what I see as the true complaint from stock image websites and illustrators - unfair competition - wouldn't be addressed, so I would expect this culture war to remain fairly heated for the foreseeable future.

Note how they aren't suing OpenAI (and by extension, Microsoft). It's not just a matter of training data but also of tactical savviness; of course Dall-E 2 (and the next model that's allegedly called Flow and will be integrated with GPT-4) can be artistic enough to compete with human artists, even if it won't be mimicking their particular styles. Indeed, it'd be more interesting to push those models to develop novel styles of their own. I also suspect some LW inspiration in this attack on minor players: Stability, for example, doesn't plan to restrict itself to the pretty picture generator business, and is therefore a problem in the paranoid world model where the fewer actors there are, the less risk.

The question is how hard it'll be to train new foundation models in the case of these people succeeding and legitimate centralized business entities like Stability going under.

The coolest new text to image model, seemingly far superior to Stable Diffusion, Midjourney V4 and even Imagen/Parti in comprehension, is Google's Muse, a pure Transformer (with T5-XXL text encoder, but that's par for the course). It's much faster at inference than its predecessors (Imagen, Parti, even non-distilled SD if you have enough VRAM) and more naturally lends itself to image modification.

We train a number of base Transformer models at different parameter sizes, ranging from 600M to 3B parameters. Each of these models is fed in the output embeddings from a T5-XXL model, which is pre-trained and frozen and consists of 4.6B parameters.

We train on the Imagen dataset consisting of 460M text-image pairs (Saharia et al., 2022). Training is performed for 1M steps, with a batch size of 512 on 512-core TPU-v4 chips (Jouppi et al., 2020). This takes about 1 week of training time.

So that's ~86000 TPU-v4-hours, which given the track record of optimizations for SD I'll consider the budget for getting a good-enough image generator we'd be able to use in perpetuity. Naively, that's something like $120K in compute using A100, maybe less depending on provider's terms. Of course if the hostility towards independent AI becomes prevalent and supported by law, it'll be hard to rent a node of 512 A100s (and harder to buy: that's like $10M in hardware if you're lucky) for such a purpose. On the other hand, if you're working in the shadows, training time and efficiency don't matter as much, and you can try to shard the workload, and use cheaper hardware (as in: 3090s from bankrupt Ethereum miners connected by Infiniband cards from the same Ebay, or rather darknet vendor)...

But before the censors even get to enforce such regulations, we'll get DeepFloyd IF, which already seems to be on par with Google's top models. It's a model developed by Alex Shonenkov, ex-head of image generation AI at Sberbank (yes, Sberbank was in the business of AI art); they are backed by Stability. It isn't trained on contemporary digital art, but, in the tradition of StableDiffusion, will trivially allow finetuning. In all likelihood, you'd need something like 8x3090, but that's about as hard to trace as a stealth weed growbox in a basement. Inference, I expect, also won't be feasible on normal consumer machines, so it'll incentivize some stealthy cloud computing, maybe very small-scale. Would be nice if the powers that be didn't succeed in crashing crypto, so we could build a p2p on-demand opensource AI inference and training economy.

All of the arithmetic above is only relevant to static images, and is therefore of limited use. What is more interesting is video+audio. What is vital is text, especially code; the best opensourced LLMs are gimmicks in comparison to GPT 3.5, and GPT 3.5 is still only barely useful in a professional context; we need something better to achieve escape velocity. There are lawsuits brewing in this domain too; the noose is tightening, the surveillance escape velocity approaches as well. I do hope that enough competent people realize the end result of the ongoing multipronged attack on ML-oriented hardware availability, unsupervised internet connectivity, gratuitous electricity expenditure, AI legality, and distributed secure ledgers before the timeline's stable state is determined. For now, it feels like pretty much everyone tech-savvy is still deriving status points from mocking cryptobros and pooh-poohing ChatGPT over its inability to write decent poetry or count to 10.

In all likelihood, you'd need something like 8x3090, but that's about as hard to trace as a stealth weed growbox in a basement. Inference, I expect, also won't be feasible on normal consumer machines, so it'll incentivize some stealthy cloud computing, maybe very small-scale.

I'll bet against that. It's supposed to be an Imagen-like model leveraging T5-XXL's encoder with a small series of 3 unets. Given that each unet is <1B, this is no worse than trying to run Muse-3B locally.

Well, I think Muse-3B won't run locally either.

How do you suppose T5-XXL's encoder is to be used, in practice? It's 5.5B, so 11GB in bf16. And StableDiffusion is 860M, but in practice it takes multiple GBs.

TLDR: it should be possible for any chump with 12GB of ordinary RAM, or some combination of offloaded RAM+vRAM that sums to 9GB, because running encoder-only is fast enough. Tests and stats mostly extrapolated from T5-3B because of personal hardware constraints (converting models costs much more memory than loading them)

There are markdown tables in this comment that do not display correctly on the site, despite appearing correctly on the comment preview. You may wish to paste the source for this comment into a markdown preview site.


To start, T5-XXL's encoder is actually 4.6B, not 5.5. I do not know why the parameters aren't evenly split between the encoder & decoder, but they aren't.

Additionally, it's likely that int8 quantisation will perform well enough for most users. load_in_8bit was recently patched to work with T5-like models, so that brings the memory requirements for loading the model down to ∼5GB.

What about vram spikes during inference? Well, unlike SD, the memory use of T5 is not going to blow significantly beyond what its parameter count would imply, assuming the prompts remain short. Running T5-3B from huggingface [0], I get small jumps of:

| dtype | vram to load | .encode(11 tokens) | .encode(75 tokens) |

|-|-|-|-|

| 3B-int8 | 3.6GB | 4.00GB | 4.35GB |

| 3B-bf16 | 6.78GB | | 7.16GB |

Note that the bump in memory for bf16 is smaller than int8 because int8 does on-the-fly type promotion shenangians.

Extrapolating these values to T5-XXL, we can expect bumps of (0.4∼0.8) * 11/3 = 1.5∼3GB of memory use for an int8 T5-XXL encoder, or <1.5GB for a bf16 encoder. We should also expect the model to take 10∼20% extra vram to load than what its parameters should imply.

So, an ideal int8 T5-XXL encoder would take up to (4.6*1.15+3)GB, or slightly more than 8GB of vram during runtime. That still locks out a substantial number of SD users -- not to mention the 10xx series users who lack int8 tensor cores to begin with. Are they fucked, then?


Short answer: no, we can get away with CPU inference via ONNX.

I first came across the idea below a Gwern comment. Given that prompts are limited to 77 tokens, would it be possible to run the encoder in a reasonable amount of wall time? Say, <60s.

Huggingface's default settings are atrociously slow, so I installed the ONNX runtime for HF Optimum and built ONNX models for T5-3B [1]. Results:

| quantized? | model size on disk | python RAM after loading (encoder+decoder) | model.encoder(**input) duration | full seq2seq pass |

|-|-|-|-|-|

| no | 4.7+6.3GB | 17.5GB | 0.27s | 42s |

| yes | 1.3+1.7GB | 8.6GB | 0.37s | 28s |

I'm not sure whether I failed to use the encoder correctly here, considering how blazing fast the numbers I got were. Even if they're wrong, an encoder pass on T5-XXL is still likely to fall below 60s.

But regardless, the tougher problem here is RAM use. Assuming it is possible to load the text encoder standalone in 8bit (I have not done so here due to incompetency, but the model filesizes are indicative), the T5-XXL text encoder would still be too large for users with merely 8GB of RAM to use. An offloading scheme with DeepSpeed would probably only marginally help there.


[0] - example code to reproduce:


PROMPT = "..."

model = T5ForConditionalGeneration.from_pretrained(model_name, device_map='auto', low_cpu_mem_usage=True, ...)#add torch_dtype=torch.bfloat16 OR load_in_8bit=True here

inputs = tokenizer(PROMPT, return_tensors='pt')

output = model.encoder(**inputs)

[1] - example code for ONNX model creation:


model_name = "t5-3b"

model_name_local = "./t5-3b-ort"

model_name_quantized = "./t5-3b-ort-quantized"


def create_ORT_base():

    model = ORTModelForSeq2SeqLM.from_pretrained(model_name, from_transformers=True)

    model.save_pretrained(model_name_local)


def create_ORT_quantized():

    model = ORTModelForSeq2SeqLM.from_pretrained(model_name_local)

    model_dir = model.model_save_dir

    #

    encoder_quantizer = ORTQuantizer.from_pretrained(model_dir, file_name="encoder_model.onnx")

    decoder_quantizer = ORTQuantizer.from_pretrained(model_dir, file_name="decoder_model.onnx")

    decoder_wp_quantizer = ORTQuantizer.from_pretrained(model_dir, file_name="decoder_with_past_model.onnx")

    quantizer = [encoder_quantizer, decoder_quantizer, decoder_wp_quantizer]

    #

    dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)

    for q in quantizer:

        q.quantize(save_dir=model_name_quantized,quantization_config=dqconfig)

I didn't have any good place to add this in my post, but it's worth noting that caching of text embeddings will help a lot with using T5-XXL. Workflows that involve large batch sizes/counts || repeated inpaintings on the same prompt do not need to keep the text encoder loaded permanently. Similar to the --lowvram mechanism implemented now, the text encoder can be loaded on demand, only when the prompt changes, saving memory costs.