site banner

Culture War Roundup for the week of February 9, 2026

This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.

Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

  • Shaming.

  • Attempting to 'build consensus' or enforce ideological conformity.

  • Making sweeping generalizations to vilify a group you dislike.

  • Recruiting for a cause.

  • Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:

  • Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.

  • Be as precise and charitable as you can. Don't paraphrase unflatteringly.

  • Don't imply that someone said something they did not say, even if you think it follows from what they said.

  • Write like everyone is reading and you want them to be included in the discussion.

On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

5
Jump in the discussion.

No email address required.

In the latest update on AI slop, Ars Technica, a once reputable publication of over 25 years, has accidentally published a fake AI written article, complete with fake quotes. Unlike the fake story shared by Nate Silver earlier, which was published on a grifter's glorified blog, and somehow syndicated into Yahoo news, this story was actually published by a "real" media company under its own label. To be fair, the ars article bears few of the obvious hallmarks of AI writing, and it also gets a passing score by most AI detectors. I suspect the authors may have lazily asked AI to create a point by point skeleton for the article, then humanly written the words themselves that appeared on the page (excluding the hallucinated fake quotes of course). Fortunately, the article was taken down quickly, but the editors have so far refused to disclaim the use of AI, amd instead are hiding behind the misquotes as a reason to take the article down. It remains to be seen whether or not the use of AI slop was actually a rouge writer violating the policy, or someone using AI as directed by management but just skimping on the checking its answers part.

In other news, Malewarebytes has joined the ranks of Cloudflare and Lenovo as multi-billion dollar multinationational corporations that decided it's necessary to each publish a library of absolutely worthless AI slop, masquerading without disclosure as legitimate content. These zero effort AI takes are ... well ... zero effort, and provide zero added value to society by being published. I have no idea if Malewarebytes is a good company, but it's certainly a real company, with offices around the globe and enterprise contracts with many fortune 500 companies. These are all companies with sales and marketing teams in the dozens or hundreds of people, and likely multiple layers of approval to do anything new, yet they decided that zero effort AI slop takes are perfectly in line with their brand and reputation. There's clearly some kind of incentives (likely mostly SEO) for real companies to publish loads and loads of fake content on their websites, tangentially or not at all related to their actual business, which is extremely unfortunate because it's a waste of time for anyone who happens across this fake content, and even a waste of time for the slopmeister who has to click the button to generate 10 million words of fake content.

I'm going to piggy back on this with two things I've seen in the last week.

The first is highly personal. My employer does annual security training, with a focus around phishing attacks. The training this year used AI-generated video that was really off-putting. The actors were "realistic", but there was an uncanny wax-like quality to their skin, and their movements weren't quite correct for human baseline. Almost everyone on my team noticed it, and it casually came up in a meeting where my boss's boss was attending. The first words out of his mouth after that was "wait, there was AI?". We all sat there silently for a few seconds. It was clear that he absolutely did not perceive that the content was AI-generated. Despite the odd, inhuman quality, he didn't even peg it as animated. It made me wonder if there's some fundamental disconnect between my brain and the brains of upper management that makes the technology entirely different for them. As a model-train American, I can't discount it, but goddamn was it weird to see in action.


The second is Something Big Is Happening, the viral post that has been storming through the pro and anti AI ranks for a few days now.

The piece itself is a tour de force demonstration of how to stoke fear and uncertainty. It essentially outlines a maximal view of the AI Jobpocalypse that many fear, written with the flat certainty of a native LinkedIn citizen.

This is different from every previous wave of automation, and I need you to understand why. AI isn't replacing one specific skill. It's a general substitute for cognitive work. It gets better at everything simultaneously. When factories automated, a displaced worker could retrain as an office worker. When the internet disrupted retail, workers moved into logistics or services. But AI doesn't leave a convenient gap to move into. Whatever you retrain for, it's improving at that too.

I think the honest answer is that nothing that can be done on a computer is safe in the medium term. If your job happens on a screen (if the core of what you do is reading, writing, analyzing, deciding, communicating through a keyboard) then AI is coming for significant parts of it. The timeline isn't "someday." It's already started.

Clearly, the only solution to being obsoleted by AI is to use as much AI as possible in the meantime, as curated by the author.

Start using AI seriously, not just as a search engine. Sign up for the paid version of Claude or ChatGPT. It's $20 a month. But two things matter right away. First: make sure you're using the best model available, not just the default. These apps often default to a faster, dumber model. Dig into the settings or the model picker and select the most capable option. Right now that's GPT-5.2 on ChatGPT or Claude Opus 4.6 on Claude, but it changes every couple of months. If you want to stay current on which model is best at any given time, you can follow me on X (@mattshumer_)

This is interesting to me for a couple of reasons. For one, it's gone pretty viral - 80 million views is a lot, and I don't know if this guy caught th zeitgeist in the way he intended. It seems like he was trying to stoke fear, but especially among my younger acquaintances, it seems like more than anything he's managed to stoke anger - a "wood and nails are cheap, AI can't build crucifixes and you don't have functioning murder drones yet" kind of way.

The second reason that it caught my attention is because the name tickled something in the back of my mind, and I didn't want to post about it until I could figure out what it was. I found the answer this morning.

I thought that name looked familiar

Based on independent tests run by Artificial Analysis, the model fails to deliver on the promises made by Matt Shumer, CEO of OthersideAI and HypeWrite, the company behind Reflection 70B. Shumer, who initially attributed the discrepancies to an issue with the model’s upload process, has since admitted that he may have gotten ahead of himself in the claims he had made.

But critics in the AI research community have gone as far as accusing Shumer of fraud, stating that the model is just a thin wrapper based on Anthropic’s Claude, rather than a tuned-up version of Meta Llama.

I'm pretty conflicted on all of this. It sure seems like the technology has real potential and real applications, but by God does it feel like every single person involved is a sociopathic narcissist who gets off on conning the rubes.

I was unaware of this article.

My attitude towards AI tools for the last 18 months had been "yeah they're useful but if you try to get too ambitious with them they waste more time than they save" and I was like AI 2027? Ha, try AI 2035.

But something fundamentally changed with the models in the last month. I'm low key freaking out at how goddamn useful they are now through Claude Code and OpenAI Codex.

Forget METR evals. My personal real world evals are that they're 6/6 on doing 2-4 week long tasks in 1-2 hours.

For what it's worth, working in a non-technical, non-coding-related field, my experience has been that some higher-ups are interested in the idea of AI and occasionally push a half-baked idea, which lower-level employees dutifully try for about two hours, conclude that it's useless, and then keep on doing things the old-fashioned way. I have yet to find any actual use-case for AI and continue to see it as a solution in search of a problem.

Maybe it's useful in some very specific, very narrow fields. Maybe coding is one of them. I'm not a coder so I don't know. But what my professional experience thus far tells me is that LLMs are good for producing large amounts of grammatically correct but turgid and unreadable bilge, and pretty much nothing else. If what you want is to mass-produce mediocre writing, well, that's what AI can do for you. If you want pretty much anything else, you're out of luck.

In a sense I think it's the ultimate 'wordcel' technology. It does symbol manipulation. It's good at translating one language into another, and apparently that it includes translating natural language instructions into computer code. But I remain skeptical as to its utility for much beyond that. It might be nice one day for someone to sit down and run through an explanation of how the heck this is supposed to get from language production and manipulation to, well, anything else.

Calling LLMs “wordcel technology” is backwards in 2026.

You can paste in a screenshot of a math problem that 99%+ of adults would fail, calculus, linear algebra, probability, geometry and it will solve it step by step, showing its work.

Not just arithmetic. Structured reasoning over formal systems. The same goes for logic puzzles, physics derivations, statistics problems.

They'll even teach it to you.

If your definition of wordcel now includes ‘solves multistep math from an image and explains it', then we're just not going to agree on the term.

Calling LLMs “wordcel technology” is backwards in 2026.

I disagree, LLMs remain pretty terrible at any task requiring strict precision, accuracy, and rigor. And from what I understand of the underlying mechanisms this is unlikely to be resolved anytime soon.

Imagine the full range of legal opinions that exist on the internet, intelligent, retarded, and everything in between. Now imagine what the average of that mass of opinions would look like. That's effectively what you're getting when you ask an LLM for legal advice. Now for some traditionally wordcel-oriented tasks like "summarize this text" or "write an essay about ____" this is more than adequate, perhaps even excellent. But for an application requiring a clear and correct answer that isn't necessarily the average/default (IE the kind of things a "shape-rotator" might be hired to calculate), they are worse than useless because they give you something that looks plausible but may very well be completely wrong, and as such you will still have to take the time to work out the correct answer yourself if only just to verify it.

Imagine the full range of legal opinions that exist on the internet, intelligent, retarded, and everything in between. Now imagine what the average of that mass of opinions would look like. That's effectively what you're getting when you ask an LLM for legal advice.

This just isn't a good model of how LLMs work. If it were doing some naive averaging of all the text it was trained on for a subject, shouldn't it randomly insert words in Spanish or Chinese? But it doesn't. If you ask an LLM whether it's a man or a woman (one without "as an AI language model" post-training), it doesn't present itself as the hermaphroditic average of the people described in its training set, it chooses one and at least tries to stick to its answer. Now, either way it's incorrect, obviously, but it's clearly not an average; a mode, perhaps. But it doesn't just naively take the mode either: If you ask it whether Harry Potter is a real person it will correctly tell you he's fictional, despite the overwhelming majority of the text concerning Harry Potter -- How many billions of words of Harry Potter fanfiction are there? -- treating him as real.

A lot of people argue that LLMs are incapable of understanding context or judging the quality of sources, but that's just... obviously untrue? Ask Gemini whether magic is real, and it'll tell you about sleight of hand and historical beliefs about witchcraft, but conclude the answer is very likely 'no.' Ask it what the spell Create or Destroy Water does and it'll quote the 5th edition rulebook. It understands what was meant by each question perfectly. And it does understand: respond to the second with 'But magic isn't real, right?' and it'll explain the implied category error as well as you could wish.

It's not that it doesn't learn the incorrect ideas in its training set -- tell it to emulate a Yahoo Answers poster and it can do so -- it just also learns contextual information about those ideas (such as that they're false) much as we do. Tell it you want a good answer (which is largely what post-training does) and it'll know to discount those sources. It doesn't do so perfectly, but the notion they lack the capacity altogether is not credible.

Regarding @dr_analog's point:

You can paste in a screenshot of a math problem that 99%+ of adults would fail, calculus, linear algebra, probability, geometry and it will solve it step by step, showing its work.

This is true so far as I know; did you actually try it? LLMs are bad at tasks requiring strict precision, accuracy and rigor that can't be objectively and automatically judged. There's a huge disconnect between performance on math/coding, where it's trivial to generate good/bad responses for DPO etc. post-training, and subjects like law, where it isn't. @dr_analog is right: LLMs are currently much better at exactly math/coding than they are at essay writing, purely due to the ease of generating high-quality synthetic data.

shouldn't it randomly insert words in Spanish or Chinese?

I actually see a fair bit of Chinese in longer conversations - not enough to make it unreadable, but enough for me to notice.

LLMs are bad at tasks requiring strict precision, accuracy and rigor that can't be objectively and automatically judged.

Take a look at the attached image. That's about a week old. Once you've looked at it, go look up that ticker. (Thanks to @ToaKraka for pointing out the image feature, BTW). That one was a pretty big shock to me from Gemini 3 fast. It doesn't do it every time, but it's done it more than once for that exact ticker.

/images/17711967195902364.webp

I actually see a fair bit of Chinese in longer conversations - not enough to make it unreadable, but enough for me to notice.

Huh, are you giving it any Chinese characters in the prompt? Which model(s)? I think I've seen this from a commercial model exactly once (Gemini 2 Pro), when I was asking some pretty in-the-weeds questions about Shinto and Japanese Buddhism and it gave me quotes in Japanese without translating them, and even there, its own words were in English. The Deepseek R1 paper mentions language confusion in reasoning blocks was a problem before post-training, but I never encountered it with the final model. I have seen it from some small open weights models, but they're kind of dumb all around.

Take a look at the attached image. That's about a week old. Once you've looked at it, go look up that ticker. (Thanks to @ToaKraka for pointing out the image feature, BTW). That one was a pretty big shock to me from Gemini 3 fast. It doesn't do it every time, but it's done it more than once for that exact ticker.

Yeah, that doesn't shock me. Not quite the case I meant. The reason code specifically is special is that they can use this process:

  1. Get a bunch of function docstrings and testing code for those functions. This sounds like a lot of work, but if you're Google, I imagine you already have a lot of well-documented, well-tested code. (If you're not Google, you can try scraping Github, though pruning low quality data would be a pain.) Not a lot of it is self-contained, but you can just include documentation or source for everything called by your existing implementation in the context.
  2. Give the model the docstring for the target function and the other documentation/source but not the original function or its testing code, then have it try to write the target function from scratch some huge number of times
  3. For each attempt, if the code it provides compiles, meets your style guidelines, and passes all tests, mark it as 'good,' and otherwise as 'bad.'
  4. Give it the same input, but ask it to write the tests. If the tests it gives you compile and meet the style guidelines, confirm that exactly the same implementations pass all the tests as for the known-good set of tests. If so, mark this generation as 'good' and otherwise as 'bad.'
  5. Now that you have a large set of good and bad responses for both code and tests for that code, you can use that for DPO (or GRPO or whatever), which trains the model to be more likely to produce good responses and less likely to produce bad ones.

Which works very well. The reason normal prose hasn't seen nearly as much improvement is that judging prose takes skilled human labor to do well, and these huge models are so data-hungry it's just not feasible to get enough of it. (I also suspect a lot of these companies like their models bland and obsequious -- customer support scripts have the same qualities, and those at least were written by real people.) So you only really see these big gains for code and math (for which a similar process can be developed).

This specific example is kind of borderline. It's a dynamic table, right? Something the model made up to answer your prompt? While it got things objectively wrong in a manner that's in principle possible to automatically check, setting up automatic checking for any claim of fact is not as easy as running pylint, which really will catch any syntax error. I imagine they do try to DPO for cases like this, but it's a lot harder.

Models are prone to just making stupid errors occasionally on even the most basic tasks, and I don't know if we're going to be able to find a real solution to that. Something that does help (and is often used on benchmarks) is taking the consensus result of several runs, but that massively inflates inference costs for a relatively small reduction in error rate. It does seem to be a hard problem, in that it's only gotten a a bit better over the past year or so. (There was more improvement in 2024, which I take as a bad sign; they've already tried the easy stuff.)