site banner

Culture War Roundup for the week of February 23, 2026

This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.

Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

  • Shaming.

  • Attempting to 'build consensus' or enforce ideological conformity.

  • Making sweeping generalizations to vilify a group you dislike.

  • Recruiting for a cause.

  • Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:

  • Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.

  • Be as precise and charitable as you can. Don't paraphrase unflatteringly.

  • Don't imply that someone said something they did not say, even if you think it follows from what they said.

  • Write like everyone is reading and you want them to be included in the discussion.

On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

4
Jump in the discussion.

No email address required.

See For Yourself: A Live Demo of LLM capabilities

As someone concerned with AI Safety or the implications of cognitive automation for human employability since well before it's cool, I must admit a sense of vindication from seeing AI dominate online discourse, including on the Motte.

We have a wide-range of views on LLM capabilities (at present) as well as on their future trajectory. Opinions are heterogeneous enough that any attempt at taxonomy will fail to capture individual nuance, but as I see it:

  • LLM Bulls: Current SOTA LLMs are capable of replacing a sizeable portion of human knowledge work. Near-future models or future architectures promise AGI, then ASI in short order. The world won't know what hit it.

  • LLM moderates: Current SOTA models are useful, but incapable of replacing even mid-level devs without negative repercussions on work quality or code performance/viability. They do not fully substitute for the labor of the average professional programmer in the West. This may or may not achieve in the near future. AGI is uncertain, ASI is less likely.

  • LLM skeptics: Current SOTA models are grossly overhyped. They are grossly incompetent at the majority of programming tasks and shouldn't be used for anything more than boilerplate, if that. AGI is unlikely in the near-term, ASI is a pipedream.

  • Gary Marcus, the dearly departed Hlynka. Opinions not worth discussing.

Then there's the question of whether LLMs or recognizable derivatives are capable of becoming AGI/ASI, or if we need to make significant discoveries in terms of new architectures and/or training pipelines (new paradigms). Fortunately, that isn't relevant right now.


Alternatively, according to Claude:

The Displacement Imminent camp thinks current models already threaten mid-level knowledge work, and the curve is steep enough that AGI is a near-term planning assumption, not a thought experiment.

The Instrumental Optimist thinks current models are genuinely useful in a supervised workflow, trajectory is positive but uncertain, AGI is possible but not imminent. This is probably the modal position among working engineers who actually use these tools.

The Tool Not Agent camp thinks current models are genuinely useful as sophisticated autocomplete or search, but the "agent" framing is mostly hype — they fail badly without tight human scaffolding, and trajectory is uncertain enough that AGI is not worth pricing in.

The Stochastic Parrot camp (your skeptics, minus the pejorative) thinks the capabilities are brittle, benchmark gaming is rampant, and real-world coding performance is far below reported evals. They're often specifically focused on the unsupervised case and the question of whether the outputs are actually understood vs. pattern-matched.

The dimension you might also want to add explicitly is who bears the cost of the failure modes — because a lot of the disagreement between practitioners isn't about raw capability but about whether the errors are cheap (easily caught, low stakes) or expensive (subtle, compounding, hard to audit). Someone who works on safety-critical systems has a very different prior than someone shipping web apps.


Coding ability is more of a vector than it is a scalar. Using a breakdown helpfully provided by ChatGPT 5.2 Thinking:


Most arguments are really about which of these capabilities you think models have:

  1. Local code generation (Boilerplate, idioms, small functions, straightforward CRUD, framework glue.)

  2. Code understanding in situ (Reading unfamiliar code, tracing control flow, handling large repos, respecting existing patterns.)

  3. Debugging and diagnosis (Finding root cause, interpreting logs, stepping through runtime behavior, reproducing bugs. Refactoring and maintenance)

  4. Changing code without breaking invariants, reducing complexity, untangling legacy.

  5. System design and requirements translation (Turning vague specs into robust design, choosing tradeoffs, anticipating failure modes.)

  6. Operational competence (Tests, CI, tooling, dependency management, security posture, deploy and rollback, observability.)

Two people can both say “LLMs are great at coding” and mean (1) only vs (1)+(2)+(6) vs “end-to-end ticket closure.”


With terminology hopefully clarified, I come to the actual proposal:

@strappingfrequent (one of the many Mottizens I am reasonably well-acquainted with off-platform), has very generously offered:

  1. A sizeable amount of tokens from his very expensive Claude Max plan ($200 a month!) and access to the latest Claude Opus.

  2. His experience using agent frameworks and orchestration. I can personally attest that he was doing this well before it was cool, I recall seeing detailed experimentation as early as GPT-4.

  3. His time in personally setting up experiments/tests, as well as overseeing their progress, while potentially interacting with an audience over a livestream.

He works as a professional programmer, and has told me that he has been consistently impressed by the capabilities of AI coding agents. They've served his needs well.

Here's his description of his skills and experience:

in my professional capacity, I've been working with Python for back-end (computer vision algorithms, FastAPI, Django) & Java (Spring). For Front-end; React. 95 percent of what I do is boilerplate, although Sonnet 3.5 did help me solve a novel problem last year but it did take quite a bit of back & forth -- the key was discussing what additional metrics I could capture to help nail down ~30+ parameters influencing a complicated computer vision pipeline.

tldr; the more represented your use case is in the training corpus, better results (probably) -- but I am absolutely confident that Opus 4.6 can help with novel problems, too. And, y'know -- Terrance Tao thinks that as well.

To what end?

He and I share a dissatisfaction with AI discourse that substitutes confident assertion for empirical investigation, and we think the most useful contribution we can make is to show the tools actually working on tasks that skeptics consider beyond their reach.

What do we want from you?

If you self-identify as someone who is either on the fence about LLMs, or strongly skeptical that they're useful for anything: share a coding challenge that you think they're presently incapable of doing, or doing well.

An ideal candidate is a proposal that you think is beyond the abilities of any LLM, while not being so difficult that we think they'd be entirely intractable. Neither of us claim that we can solve Fermat's Last Problem (or that Claude can solve it for us).

Other requirements:

  • A clear problem specification, or a willingness to submit a vaguer one and then approve a tighter version as created by us/Claude.

  • Nothing so easy/trivial that a quick Google shows that someone's already done it. If you want a C++ compiler written by an LLM, well, there's one out there (though that is the opposite of trivial).

  • Nothing too hard. He provides an example of "coding a Netflix clone in 4 hours".

  • An agreement on the degree of human intervention allowed. Can we prompt the model if it gets stuck? Help it in other ways? Do you want to add something to the scope later? (Strongly inadvisable). Note that if you expect literally zero human intervention, SF isn't game. He says: "I don't think I'd care to demonstrate any sort of zero-shot capacity... that's a silly expectation. If my prime orchestrator spends 30 minutes building a full-stack webapp that doesn't work I'll say It doesn't work; troubleshoot, please. I trust your judgement."

  • A time-horizon. Even a Max plan has its limits, we can't be expected to start a task that'll take days to complete.

  • Some kind of semi-objective rubric for grading the outcome, if it isn't immediately obvious. Is it enough to succeed at all? Or do you want code that even Torvalds can't critique? And no, "I know it when I see it" isn't really good enough, for obvious reasons. Ideally, give us an idea of the tests everything needs to pass.

  • If your task requires the model to review/extend proprietary code, that's not off the table entirely, but it's up to you to make sure we can access it. Either send us a copy or point us at a repo.

  • Nothing illegal.

But to sum, up, we want a task that we agree is probably feasible for an LLM, and where success will change your mind significantly. By which I mean: "If it succeeds at X, I will revise my estimate of LLM capability from Y to Z." Otherwise I can only imagine a lot of post-hoc goalpost movement, "okay but it still needed 3 prompts" or "the code works but a good programmer would have done it differently."

We reserve the right to choose which proposals we attempt, partly because some will be more interesting than others, and partly because we have finite tokens and finite time.


Miscellaneous concerns:

Why Claude Opus 4.6?

Well, the most honest answer is that @strappingfrequent already has an Claude Max plan, and is familiar with its capabilities. The other SOTA competitors include Gemini 3.1 Pro and GPT 5.3 Codex, which are nominally superior on a few benchmarks, but a very large fraction of programmers insist that Claude is still the best for general programming use cases. We don't think this choice matters that much, and the models are in fact roughly interchangeable while being noticeably superior to anything released before very late 2025.

Why bother at all?

We are hoping to change minds. This is disclosed upfront. We're also open to changing our own minds. We would also be genuinely interested to discover cases where we expect success and get failure - mapping the capability frontier is valuable independent of which direction the evidence points. We will share logs and a final repo either way. An interactive livestream is possible if there is sufficient interest.

Anything else?

You know the model. We'll be using Claude Code. The specifics of the demo are TBD, it could be a livestream with user engagement if there's sufficient interest, otherwise we can dump logs and share a final repo.

The floor is open. What do you think Claude cannot do?


Edit:

I'll forward the proposals to @strappingfrequent, assuming he doesn't show up in the thread himself. I am clearly not the real expert here, plus he's doing all of the heavy lifting. Expect at least a day or two of back and forth before we come to a consensus (but if he agrees to something, then assume I won't object, but not vice versa unless specifically noted), and that includes conversations within this thread to narrow down the scope and make things as rigorous as possible while adhering to the restrictions we've mentioned. At some point, we'll announce winners and get to scheduling.

I don't think you're quite clear in the post as to what camp you're actually in. Are you a straight bull? As in, do you think it can currently replace a sizeable portion of human knowledge work?

Moreover, it is not clear how knowledge work that is not coding qua coding fits into your schema. For example, I have in mind a flight dynamics simulation/control task. I'm not settled on it yet. My plan was to include a little twist that I had thought would likely not be in the published literature, but which I'm sure I could manage without too much difficulty, just pulling one book off of my shelf, confirming where exactly I need to make the modification and how (it's been a long time, but it's something I'm confident I could do without extreme effort), and then coding it. Unfortunately, I looked, and some darned student already published it (only minimal code published AFAICT, but they wrote out all the analysis in detail, so I can't really purely test its ability to do this aspect of the knowledge work on its own), so I'm trying to think of another good variant.

There are other little twists I had in mind, hoping to prevent it from being able to purely just pull code directly from others. These twists are things I've personally coded in the past, so I know they're doable. But the point is that they require sufficient knowledge to make choices along the way (for one example, choose this algorithm for this part, because I know it has certain characteristics) and I think they prevent it from being able to just use someone else's work for the core simulation components.

I guess, where does this fit within your schema, and where are you with respect to your own opinions? There is a lot of room between, "I personally know how to architect this code, what algorithms/assumptions to use, how to modify the analysis for the instant case, and then I use Claude to help with building the components", "I do the analysis, give it to it, tell it to code up the whole thing, then I go in and tell it to change things to make better choices that fit my knowledge-work-educated beliefs on how it should be done," and, "I tell it to code up the whole thing, maybe tell it that something's broken, but part of the test is whether it made the right analysis and knowledge-work-educated choices on its own along the way."

In other words, what I'm interested in is not so much about what it can do in terms of coding qua coding. It could be utterly magical at that, and that would be great. But how much of my own knowledge work do I need to input to get it to code the "right" thing, versus how much it's able to make the correct choices on its own about what the "right" thing is.

I don't think you're quite clear in the post as to what camp you're actually in. Are you a straight bull? As in, do you think it can currently replace a sizeable portion of human knowledge work?

I've shared my thoughts on LLMs consistently here, for years. It wasn't central to this particular demo.

But if you want to know:

  • I do not hold very high confidence claims on their capabilities in coding, because I'm not a professional programmer. I get the impression that they're very useful, on the basis of statements made by people like Karpathy, and by observing specific advances.
  • I think they are already capable of replacing a large chunk of existing knowledge work. The market hasn't caught up to this, if LLM progress was arrested right here, we'd see seismic shocks as industries adjusted years into the future.
  • Specifically in the field of medicine, they can do most of the raw cognitive labor doctors do - as well or better than the average doctor. I could automate 90% of my job today, leaving aside the physical tasks. The primary thing holding me back is archaic NHS IT. LLMs give solid medical advice.
  • I have a median timeline for AGI that's ~2030. 70% CI by 2035. I put a very non-negligible chance on it arriving by start of 2028 or even 2027.
  • I do not make strong claims on if the current Transformer architecture/LLMs is capable of scaling into AGI, or if we need new paradigms. Even if we do, I think the ludicrous amounts of monetary investment and the attention of thousands of the smartest humans alive will likely find it.

I think this would probably make me an LLM bull, even if I'm not maximally bullish. Definitely "displacement imminent".

I would call you a moderate under my schema, and probably an "instrumental optimist".

Either way, I don't think you're our target audience for this demo, since you personally and professionally use SOTA LLMs with regularity and are familiar with their pitfalls.

Specifically in the field of medicine, they can do most of the raw cognitive labor doctors do - as well or better than the average doctor.

As an outsider, I am unsure of how impressive this is. I know that "most of the raw cognitive labor programmers do while writing code" is fairly rote, but I don't know how true that is for doctors.

How much of the raw cognitive labor doctors do could be done by a bright undergrad with access to uptodate and a bunch of case histories, both with semantic search?

70% of medicine is minimizing unknown unknowns by knowing as much as you can, and knowing the boundaries of what is unknown to you. I believe a more concise way of expressing that is "knowledge". Regretfully, the books are fat and intimidating for good reason, there's are a lot of things to know.

30% of the rest is reasoning from knowledge, clinical experience (yet another form of knowledge, just the stuff the textbooks don't tell you) and pattern recognition.* This is more dependent on your wits, or your fluid intelligence, if I'm being precise.

The best doctors both know a lot, and are bright enough to apply that information well. The former is indispensable, you simply cannot figure out medicine by sitting in a cave and thinking very hard. I don't know if some superintelligence can look at a single human without the aid of tools, ponder very hard, and figure out everything work knowing. All I can say is that it's beyond any actual human.

(IQ/g also correlates strongly with memory, so the relative importance of both is very hard to tease out. Especially when there's a high-pass filter with all most of the idiots and amnesiacs strained out by the end of med school)

How much of the raw cognitive labor doctors do could be done by a bright undergrad with access to uptodate and a bunch of case histories, both with semantic search?

Let me put it this way: I was a bright kid, and felt like I knew a lot of medicine before entering med school, both due to cultural osmosis and because I took an interest in it. You would not have wanted me as your actual doctor. I did not know nearly as much as I thought I did.

Later, I was a med student, a year or two in and confident that I knew the gist of it. I felt ready to make my own medical decisions, at least about myself. I thought I was smart and that I did my due diligence (reading things online, including research papers). It was insufficient, I did potentially permanent damage to my own health (I'm not going to go into details). I would not want that me as my doctor either.

Now, I am a lot older and a little more knowledgeable, if not necessarily wiser. You could do worse as your doctor, at least if we're sticking to psychiatry. You could probably do better too, but I have a place on the free market. I'm cheap, I give away my advice for free on the internet to anyone who asks nicely, and many who don't.

Along the way, I almost killed people through ignorance. Thankfully, nobody died, my colleagues caught it, or the pharmacist did, or I had a sudden sinking feeling in my gut and ran back to double check. Medicine recognizes that any human is fallible, and there are plenty of safeguards in place. Every junior doctor has their story of close calls, and hopefully nothing more than close calls. All senior doctors start as junior doctors, I hope.

Consider something else: most doctors will seek out a different doctor when they suffer a condition that isn't covered by their own specialty. Sometimes even then.

If a cardiologist feels funny in the head, he'll seek a neurologist. If a neurologist feels heart palpitations, he'll go talk to a cardiologist.

Why is that? Could they both not just open the relevant textbooks and figure out what the issue is? Can a cardiologist not take his med school knowledge of neurology and then skim something Elsevier put out?

These are people with complete medical training, genuine intelligence, and full access to literature, and they still defer to each other. That's not false modesty or liability management, it's that they've learned, through experience, exactly where their pattern recognition breaks down. They know the limits of their own competence.

Maybe. It might work out fine 90% of the time. But most doctors can handle ~90% of conditions, because most conditions are common and usually simple to manage. I apologize for the tautology, I can't see my way around it.

The other 10% are where the specialists come in. You cannot take a psychiatrist (even a smart one) and give him access to UpToDate and expect him to be as good a cardiologist as an actual trained cardiologist. He might do okay, but he's going to kill people along the way.

And that is a fully qualified doctor dabbling in another branch of medicine. A "bright undergrad with access to uptodate and a bunch of case histories, both with semantic search" will crash and burn. I'd bet good money on it, it'll happen sooner rather than later.

If they set up shop and started seeing patients, bumbling their way through things and furiously looking things up as soon as they could, they might successfully treat the colds, stomach upsets, sore throats and so on. That's the bulk of undifferentiated medicine, as you'd expect. They might catch some of the rarer stuff. They will also be very poorly calibrated and commit significant iatrogenic harm. But rest assured they will kill people eventually (at a rate massively higher than a doctor normally does).

That's not even getting into time pressure, or physical findings and techniques that are impossible to adequately convey over just video and text.

LLMs? They narrow the gap significantly, but do not have thumbs. The bright undergrad would benefit immensely from ChatGPT, but rest assured that most of the performance would come from ChatGPT itself, and they would add little. Handcuffing a child to a man does not make their combination superior.

The combination of factors that make a good human clinician are rare. And when you do find them, you're investing a great deal in training to get them up to scratch. Most of this is the bottleneck of information transfer/learning, which LLMs neatly sidestep. GPT-4 did well, and it was dumb as bricks compared to current models. Turns out an encyclopedic knowledge of medicine will get you very far, even if you're not very bright. But it was also able to access and process this information faster than your thought experiment of a human with a computer.

But if you want a final answer: 60-70%. Best estimate I have.

*Sufficiently advanced pattern recognition is indistinguishable from intelligence. It might well be intelligence. You know LLMs, you know this.