@dr_analog comments on "Culture War Roundup for the week of February 9, 2026

Culture War Roundup for the week of February 9, 2026

This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.

Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

Shaming.
Attempting to 'build consensus' or enforce ideological conformity.
Making sweeping generalizations to vilify a group you dislike.
Recruiting for a cause.
Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:

Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.
Be as precise and charitable as you can. Don't paraphrase unflatteringly.
Don't imply that someone said something they did not say, even if you think it follows from what they said.
Write like everyone is reading and you want them to be included in the discussion.

On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

Jump in the discussion.

No email address required.

dr_analog top 1% of underdog fetishists 4mo ago · Edited 4mo ago

I was unaware of this article.

My attitude towards AI tools for the last 18 months had been "yeah they're useful but if you try to get too ambitious with them they waste more time than they save" and I was like AI 2027? Ha, try AI 2035.

But something fundamentally changed with the models in the last month. I'm low key freaking out at how goddamn useful they are now through Claude Code and OpenAI Codex.

Forget METR evals. My personal real world evals are that they're 6/6 on doing 2-4 week long tasks in 1-2 hours.

Context

OliveTapenade dr_analog 4mo ago · Edited 4mo ago

For what it's worth, working in a non-technical, non-coding-related field, my experience has been that some higher-ups are interested in the idea of AI and occasionally push a half-baked idea, which lower-level employees dutifully try for about two hours, conclude that it's useless, and then keep on doing things the old-fashioned way. I have yet to find any actual use-case for AI and continue to see it as a solution in search of a problem.

Maybe it's useful in some very specific, very narrow fields. Maybe coding is one of them. I'm not a coder so I don't know. But what my professional experience thus far tells me is that LLMs are good for producing large amounts of grammatically correct but turgid and unreadable bilge, and pretty much nothing else. If what you want is to mass-produce mediocre writing, well, that's what AI can do for you. If you want pretty much anything else, you're out of luck.

In a sense I think it's the ultimate 'wordcel' technology. It does symbol manipulation. It's good at translating one language into another, and apparently that it includes translating natural language instructions into computer code. But I remain skeptical as to its utility for much beyond that. It might be nice one day for someone to sit down and run through an explanation of how the heck this is supposed to get from language production and manipulation to, well, anything else.

fmac Ask me about bike lanes OliveTapenade 4mo ago

It might be nice one day for someone to sit down and run through an explanation of how the heck this is supposed to get from language production and manipulation to, well, anything else.

Try using Claude Cowork, or the Codex app if you're on a Mac. Those programs are the bridge between "this is wordcel technology" and "this technology is going to change how we interact with computers forever".

OliveTapenade fmac 4mo ago

Asking a bot would defeat the whole point of the exercise.

For several reasons.

dr_analog top 1% of underdog fetishists OliveTapenade 4mo ago

Calling LLMs “wordcel technology” is backwards in 2026.

You can paste in a screenshot of a math problem that 99%+ of adults would fail, calculus, linear algebra, probability, geometry and it will solve it step by step, showing its work.

Not just arithmetic. Structured reasoning over formal systems. The same goes for logic puzzles, physics derivations, statistics problems.

They'll even teach it to you.

If your definition of wordcel now includes ‘solves multistep math from an image and explains it', then we're just not going to agree on the term.

JeSuisCharlie Sumner, Hebdo, Kirk dr_analog 4mo ago · Edited 4mo ago

I disagree, LLMs remain pretty terrible at any task requiring strict precision, accuracy, and rigor. And from what I understand of the underlying mechanisms this is unlikely to be resolved anytime soon.

Imagine the full range of legal opinions that exist on the internet, intelligent, retarded, and everything in between. Now imagine what the average of that mass of opinions would look like. That's effectively what you're getting when you ask an LLM for legal advice. Now for some traditionally wordcel-oriented tasks like "summarize this text" or "write an essay about ____" this is more than adequate, perhaps even excellent. But for an application requiring a clear and correct answer that isn't necessarily the average/default (IE the kind of things a "shape-rotator" might be hired to calculate), they are worse than useless because they give you something that looks plausible but may very well be completely wrong, and as such you will still have to take the time to work out the correct answer yourself if only just to verify it.

P-Necromancer JeSuisCharlie 4mo ago

Imagine the full range of legal opinions that exist on the internet, intelligent, retarded, and everything in between. Now imagine what the average of that mass of opinions would look like. That's effectively what you're getting when you ask an LLM for legal advice.

This just isn't a good model of how LLMs work. If it were doing some naive averaging of all the text it was trained on for a subject, shouldn't it randomly insert words in Spanish or Chinese? But it doesn't. If you ask an LLM whether it's a man or a woman (one without "as an AI language model" post-training), it doesn't present itself as the hermaphroditic average of the people described in its training set, it chooses one and at least tries to stick to its answer. Now, either way it's incorrect, obviously, but it's clearly not an average; a mode, perhaps. But it doesn't just naively take the mode either: If you ask it whether Harry Potter is a real person it will correctly tell you he's fictional, despite the overwhelming majority of the text concerning Harry Potter -- How many billions of words of Harry Potter fanfiction are there? -- treating him as real.

A lot of people argue that LLMs are incapable of understanding context or judging the quality of sources, but that's just... obviously untrue? Ask Gemini whether magic is real, and it'll tell you about sleight of hand and historical beliefs about witchcraft, but conclude the answer is very likely 'no.' Ask it what the spell Create or Destroy Water does and it'll quote the 5th edition rulebook. It understands what was meant by each question perfectly. And it does understand: respond to the second with 'But magic isn't real, right?' and it'll explain the implied category error as well as you could wish.

It's not that it doesn't learn the incorrect ideas in its training set -- tell it to emulate a Yahoo Answers poster and it can do so -- it just also learns contextual information about those ideas (such as that they're false) much as we do. Tell it you want a good answer (which is largely what post-training does) and it'll know to discount those sources. It doesn't do so perfectly, but the notion they lack the capacity altogether is not credible.

Regarding @dr_analog's point:

This is true so far as I know; did you actually try it? LLMs are bad at tasks requiring strict precision, accuracy and rigor that can't be objectively and automatically judged. There's a huge disconnect between performance on math/coding, where it's trivial to generate good/bad responses for DPO etc. post-training, and subjects like law, where it isn't. @dr_analog is right: LLMs are currently much better at exactly math/coding than they are at essay writing, purely due to the ease of generating high-quality synthetic data.

birb_cromble P-Necromancer 4mo ago

shouldn't it randomly insert words in Spanish or Chinese?

I actually see a fair bit of Chinese in longer conversations - not enough to make it unreadable, but enough for me to notice.

LLMs are bad at tasks requiring strict precision, accuracy and rigor that can't be objectively and automatically judged.

Take a look at the attached image. That's about a week old. Once you've looked at it, go look up that ticker. (Thanks to @ToaKraka for pointing out the image feature, BTW). That one was a pretty big shock to me from Gemini 3 fast. It doesn't do it every time, but it's done it more than once for that exact ticker.

/images/17711967195902364.webp

P-Necromancer birb_cromble 4mo ago

Huh, are you giving it any Chinese characters in the prompt? Which model(s)? I think I've seen this from a commercial model exactly once (Gemini 2 Pro), when I was asking some pretty in-the-weeds questions about Shinto and Japanese Buddhism and it gave me quotes in Japanese without translating them, and even there, its own words were in English. The Deepseek R1 paper mentions language confusion in reasoning blocks was a problem before post-training, but I never encountered it with the final model. I have seen it from some small open weights models, but they're kind of dumb all around.

Yeah, that doesn't shock me. Not quite the case I meant. The reason code specifically is special is that they can use this process:

Get a bunch of function docstrings and testing code for those functions. This sounds like a lot of work, but if you're Google, I imagine you already have a lot of well-documented, well-tested code. (If you're not Google, you can try scraping Github, though pruning low quality data would be a pain.) Not a lot of it is self-contained, but you can just include documentation or source for everything called by your existing implementation in the context.
Give the model the docstring for the target function and the other documentation/source but not the original function or its testing code, then have it try to write the target function from scratch some huge number of times
For each attempt, if the code it provides compiles, meets your style guidelines, and passes all tests, mark it as 'good,' and otherwise as 'bad.'
Give it the same input, but ask it to write the tests. If the tests it gives you compile and meet the style guidelines, confirm that exactly the same implementations pass all the tests as for the known-good set of tests. If so, mark this generation as 'good' and otherwise as 'bad.'
Now that you have a large set of good and bad responses for both code and tests for that code, you can use that for DPO (or GRPO or whatever), which trains the model to be more likely to produce good responses and less likely to produce bad ones.

Which works very well. The reason normal prose hasn't seen nearly as much improvement is that judging prose takes skilled human labor to do well, and these huge models are so data-hungry it's just not feasible to get enough of it. (I also suspect a lot of these companies like their models bland and obsequious -- customer support scripts have the same qualities, and those at least were written by real people.) So you only really see these big gains for code and math (for which a similar process can be developed).

This specific example is kind of borderline. It's a dynamic table, right? Something the model made up to answer your prompt? While it got things objectively wrong in a manner that's in principle possible to automatically check, setting up automatic checking for any claim of fact is not as easy as running pylint, which really will catch any syntax error. I imagine they do try to DPO for cases like this, but it's a lot harder.

Models are prone to just making stupid errors occasionally on even the most basic tasks, and I don't know if we're going to be able to find a real solution to that. Something that does help (and is often used on benchmarks) is taking the consensus result of several runs, but that massively inflates inference costs for a relatively small reduction in error rate. It does seem to be a hard problem, in that it's only gotten a a bit better over the past year or so. (There was more improvement in 2024, which I take as a bad sign; they've already tried the easy stuff.)

I'm not giving Chinese characters in the prompts. I don't speak a lick of Chinese. I've seen it in Gemini 3 fast, thinking, and pro. Usually it's for questions about electronics, though it's come up for questions about music theory as well.

OliveTapenade JeSuisCharlie 4mo ago

That still fits my experience with them - I have spent some time mucking about with them, and every time I ask an LLM about something I know, it will frequently be confidently, even hilariously wrong. It is not aware of any difference between truth and falsehood and will freely mix them together. I want to avoid some kind of AI Gell-Mann Amnesia. When I ask it questions I know the answer to, it consistently prioritises producing something that looks like a confident, helpful, well-written answer, in total agnosticism as to whether or not that answer is true. It surely does the same thing with questions I don't know the answer to. The only sensible course of action is to assign zero credence to anything an LLM says. What it says might be true. Or it might not be. The LLM's word is worth nothing.

HereAndGone2 OliveTapenade 4mo ago

Same here. My major use of AI, did I use it, would be writing emails but I can write my own emails. My boss does seem to use it for that, but I don't get any of the AI emails from them so I don't know where they're sending them.

If I knew enough about AI to use it in other work, I might be more impressed. But right now, what I'm seeing are the chatbots used to replace customer service agents on business websites, which are absolutely useless when I try to ask them to solve my queries. So I remain unimpressed.

The coding stuff sounds like where all the progress is happening, but like you I'm nowhere near writing software or using it. Maybe in a little while I'll see a use for it in clerical work, but right now I don't trust the answers provided by the (admittedly free online models such as Copilot) AI to be accurate or reliable.

fmac Ask me about bike lanes HereAndGone2 4mo ago

Copilot

Not an AI simp (but to be clear of my bias, I find AI both fun to use and quite useful at a handful of scoped tasks both in my work and personal lives, but it has many limitations).

Copilot is beyond trash. Copilot is the worst AI product I've ever interacted with. No conclusions on AI performance should ever be drawn from Copilot, as it is abjectly awful.

HereAndGone2 fmac 4mo ago

I hate it. Windows keeps trying to shove it in my face at work and I resolutely refuse to use it. I know you get what you pay for.

At the moment, I don't see a use for AI, but that's mostly because what I see about it is (1) it's helping/replacing writing software (2) it's helping/going to replace lawyers and doctors and neither of those are what I work at. I'd love a version I could turn loose on answering stupid emails, but right now I have to answer those myself since it's dealing with funding bodies/government departments and unless I can be sure the AI wouldn't hallucinate some answer to get us all in trouble, it's just not a tool that is useful for me.

In my personal life, again, I don't see a use for it. "it'll write recipes for you! plan your grocery shopping! research holidays!" Well, I don't go on holidays, I cook very basic things, and I prefer doing my own shopping. While I may badly need psychotherapy of some kind, I'll be hanged before I try talking to a bot about my personal life.

zeke5123a dr_analog 4mo ago

I’m skeptical but you may be right. See the market the last week.

But even the market isn’t really where you are.

What is this place?

Why are you called The Motte?

New post guidelines

Rules

Recommended Posts And Communities

Recommended Realtime Chats