site banner

Culture War Roundup for the week of March 10, 2025

This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.

Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

  • Shaming.

  • Attempting to 'build consensus' or enforce ideological conformity.

  • Making sweeping generalizations to vilify a group you dislike.

  • Recruiting for a cause.

  • Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:

  • Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.

  • Be as precise and charitable as you can. Don't paraphrase unflatteringly.

  • Don't imply that someone said something they did not say, even if you think it follows from what they said.

  • Write like everyone is reading and you want them to be included in the discussion.

On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

5
Jump in the discussion.

No email address required.

Moderately interesting news in AI image gen:

It's been a good while since we've had AI chat assistants able to generate images on user request. Unfortunately, for about as long, we've had people being peeved at the disconnect between what they asked for, and what they actually got. Particularly annoying was the tendency for the assistants to often claim to have generated what you desired, or that they edited an image to change it, without actually doing that.

This was an unfortunate consequence of the LLM, being the assistant persona you speak to, and the actual image generator that spits out images from prompts, actually being two entirely separate entities. The LLM doesn't have any more control over the image model than you do when running something like Midjourney or Stable Diffusion. It's sending a prompt through a function call, getting an image in response, and then trying to modify prompts to meet user needs. Depending on how lazy the devs are, it might not even be 'looking' at the final output at all.

The image models, on the other hand, are a fundamentally different architecture, usually being diffusion-based (Google a better explanation, but the gist of it is that they hallucinate iteratively from a sample of random noise till it resembles the desired image) whereas LLMs use the Transformer architecture. The image models do have some understanding of semantics, but they're far stupider than LLMs when it comes to understanding finer meaning in prompts.

This has now changed.

Almost half a year back, OpenAI teased the ability of their then unreleased GPT-4o to generate images natively. It was the LLM (more of a misnomer now than ever) actually making the image, in the same manner it could output text or audio.

The LLM doesn’t just “talk” to the image generator - it is the image generator, processing everything as tokens, much like it handles text or audio.

Unfortunately, we had nothing but radio silence since then, barring a few leaks of front-end code suggesting OAI would finally switch from DALLE-3 for image generation to using GPT-4o, as well as Altman's assurances that they hadn't canned the project on the grounds of safety.

Unfortunately for him, Google has beaten them to the punch . Gemini 2.0 Flash Experimental (don't ask) has now been blessed with the ability to directly generate images. I'm not sure if this has rolled out to the consumer Gemini app, but it's readily accessible on their developer preview.

First impressions: It's good.

You can generate an image, and then ask it to edit a feature. It will then edit the original image and present the version modified to your taste, unlike all other competitors, who would basically just re-prompt and hope for better luck on the second roll.

Image generation just got way better, at least in the realm of semantic understanding. Most of the usual give-aways of AI generated imagery, such as butchered text, are largely solved. It isn't perfect, but you're looking at a failure rate of 5-10% as opposed to >80% when using DALLE or Flux. It doesn't beat Midjourney on aesthetics, but we'll get there.

You can imagine the scope for chicanery, especially if you're looking to generate images with large amounts of verbiage or numbers involved. I'd expect the usual censoring in consumer applications, especially since the LLM has finer control over things. But it certainly massively expands the mundane utility of image generation, and is something I've been looking forward to ever since I saw the capabilities demoed.

Flash 2.0 Experimental is also a model that's dirt cheap on the API, and while image gen definitely burns more tokens, it's a trivial expense. I'd strongly expect Google to make this free just to steal OAI's thunder.

unlike all other competitors, who would basically just re-prompt and hope for better luck on the second roll.

That claim is just flat out false. Inpainting only specific areas of the original image (even from text description) has been in use for multiple years now (there's even an extension for that for the open source AUTOMATIC1111 Stable Diffusion webui). Only complete novices rely on rerolling.

I think you're being blinded by your single minded enthusiasm for LLMs and are massively overestimating their capabilities as well as ignoring the wider state of the field.

The LLM doesn’t just “talk” to the image generator - it is the image generator, processing everything as tokens, much like it handles text or audio.

The LLM is still talking to the image generator. It just does so using native tokens and vectors instead of going through a text encoder layer in-between.

The LLM is still talking to the image generator. It just does so using native tokens and vectors instead of going through a text encoder layer in-between.

This is very confusingly stated. The second sentence is correct, but in the first one, it’s confusing to say that LLM is talking to image generator, because the LLM and the image generator are literally the same thing.

The claim that has no proof (beyond marketing speak) is that they are the same thing. I don't believe the claim and the evidence doesn't show anything to support the claim as opposed to just skipping the text encoder and talking directly to the actual image generator in its native format.