site banner

Culture War Roundup for the week of January 16, 2023

This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.

Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

  • Shaming.

  • Attempting to 'build consensus' or enforce ideological conformity.

  • Making sweeping generalizations to vilify a group you dislike.

  • Recruiting for a cause.

  • Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:

  • Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.

  • Be as precise and charitable as you can. Don't paraphrase unflatteringly.

  • Don't imply that someone said something they did not say, even if you think it follows from what they said.

  • Write like everyone is reading and you want them to be included in the discussion.

On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

13
Jump in the discussion.

No email address required.

TLDR: it should be possible for any chump with 12GB of ordinary RAM, or some combination of offloaded RAM+vRAM that sums to 9GB, because running encoder-only is fast enough. Tests and stats mostly extrapolated from T5-3B because of personal hardware constraints (converting models costs much more memory than loading them)

There are markdown tables in this comment that do not display correctly on the site, despite appearing correctly on the comment preview. You may wish to paste the source for this comment into a markdown preview site.


To start, T5-XXL's encoder is actually 4.6B, not 5.5. I do not know why the parameters aren't evenly split between the encoder & decoder, but they aren't.

Additionally, it's likely that int8 quantisation will perform well enough for most users. load_in_8bit was recently patched to work with T5-like models, so that brings the memory requirements for loading the model down to ∼5GB.

What about vram spikes during inference? Well, unlike SD, the memory use of T5 is not going to blow significantly beyond what its parameter count would imply, assuming the prompts remain short. Running T5-3B from huggingface [0], I get small jumps of:

| dtype | vram to load | .encode(11 tokens) | .encode(75 tokens) |

|-|-|-|-|

| 3B-int8 | 3.6GB | 4.00GB | 4.35GB |

| 3B-bf16 | 6.78GB | | 7.16GB |

Note that the bump in memory for bf16 is smaller than int8 because int8 does on-the-fly type promotion shenangians.

Extrapolating these values to T5-XXL, we can expect bumps of (0.4∼0.8) * 11/3 = 1.5∼3GB of memory use for an int8 T5-XXL encoder, or <1.5GB for a bf16 encoder. We should also expect the model to take 10∼20% extra vram to load than what its parameters should imply.

So, an ideal int8 T5-XXL encoder would take up to (4.6*1.15+3)GB, or slightly more than 8GB of vram during runtime. That still locks out a substantial number of SD users -- not to mention the 10xx series users who lack int8 tensor cores to begin with. Are they fucked, then?


Short answer: no, we can get away with CPU inference via ONNX.

I first came across the idea below a Gwern comment. Given that prompts are limited to 77 tokens, would it be possible to run the encoder in a reasonable amount of wall time? Say, <60s.

Huggingface's default settings are atrociously slow, so I installed the ONNX runtime for HF Optimum and built ONNX models for T5-3B [1]. Results:

| quantized? | model size on disk | python RAM after loading (encoder+decoder) | model.encoder(**input) duration | full seq2seq pass |

|-|-|-|-|-|

| no | 4.7+6.3GB | 17.5GB | 0.27s | 42s |

| yes | 1.3+1.7GB | 8.6GB | 0.37s | 28s |

I'm not sure whether I failed to use the encoder correctly here, considering how blazing fast the numbers I got were. Even if they're wrong, an encoder pass on T5-XXL is still likely to fall below 60s.

But regardless, the tougher problem here is RAM use. Assuming it is possible to load the text encoder standalone in 8bit (I have not done so here due to incompetency, but the model filesizes are indicative), the T5-XXL text encoder would still be too large for users with merely 8GB of RAM to use. An offloading scheme with DeepSpeed would probably only marginally help there.


[0] - example code to reproduce:


PROMPT = "..."

model = T5ForConditionalGeneration.from_pretrained(model_name, device_map='auto', low_cpu_mem_usage=True, ...)#add torch_dtype=torch.bfloat16 OR load_in_8bit=True here

inputs = tokenizer(PROMPT, return_tensors='pt')

output = model.encoder(**inputs)

[1] - example code for ONNX model creation:


model_name = "t5-3b"

model_name_local = "./t5-3b-ort"

model_name_quantized = "./t5-3b-ort-quantized"


def create_ORT_base():

    model = ORTModelForSeq2SeqLM.from_pretrained(model_name, from_transformers=True)

    model.save_pretrained(model_name_local)


def create_ORT_quantized():

    model = ORTModelForSeq2SeqLM.from_pretrained(model_name_local)

    model_dir = model.model_save_dir

    #

    encoder_quantizer = ORTQuantizer.from_pretrained(model_dir, file_name="encoder_model.onnx")

    decoder_quantizer = ORTQuantizer.from_pretrained(model_dir, file_name="decoder_model.onnx")

    decoder_wp_quantizer = ORTQuantizer.from_pretrained(model_dir, file_name="decoder_with_past_model.onnx")

    quantizer = [encoder_quantizer, decoder_quantizer, decoder_wp_quantizer]

    #

    dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)

    for q in quantizer:

        q.quantize(save_dir=model_name_quantized,quantization_config=dqconfig)

I didn't have any good place to add this in my post, but it's worth noting that caching of text embeddings will help a lot with using T5-XXL. Workflows that involve large batch sizes/counts || repeated inpaintings on the same prompt do not need to keep the text encoder loaded permanently. Similar to the --lowvram mechanism implemented now, the text encoder can be loaded on demand, only when the prompt changes, saving memory costs.