@Rov_Scam comments on "Culture War Roundup for the week of February 23, 2026

Culture War Roundup for the week of February 23, 2026

This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.

Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

Shaming.
Attempting to 'build consensus' or enforce ideological conformity.
Making sweeping generalizations to vilify a group you dislike.
Recruiting for a cause.
Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:

Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.
Be as precise and charitable as you can. Don't paraphrase unflatteringly.
Don't imply that someone said something they did not say, even if you think it follows from what they said.
Write like everyone is reading and you want them to be included in the discussion.

On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

Jump in the discussion.

No email address required.

self_made_human amaratvaṃ prāpnuhi, athavā yatamāno mṛtyum āpnuhi 4mo ago · Edited 4mo ago

See For Yourself: A Live Demo of LLM capabilities

As someone concerned with AI Safety or the implications of cognitive automation for human employability since well before it's cool, I must admit a sense of vindication from seeing AI dominate online discourse, including on the Motte.

We have a wide-range of views on LLM capabilities (at present) as well as on their future trajectory. Opinions are heterogeneous enough that any attempt at taxonomy will fail to capture individual nuance, but as I see it:

LLM Bulls: Current SOTA LLMs are capable of replacing a sizeable portion of human knowledge work. Near-future models or future architectures promise AGI, then ASI in short order. The world won't know what hit it.
LLM moderates: Current SOTA models are useful, but incapable of replacing even mid-level devs without negative repercussions on work quality or code performance/viability. They do not fully substitute for the labor of the average professional programmer in the West. This may or may not achieve in the near future. AGI is uncertain, ASI is less likely.
LLM skeptics: Current SOTA models are grossly overhyped. They are grossly incompetent at the majority of programming tasks and shouldn't be used for anything more than boilerplate, if that. AGI is unlikely in the near-term, ASI is a pipedream.
Gary Marcus, the dearly departed Hlynka. Opinions not worth discussing.

Then there's the question of whether LLMs or recognizable derivatives are capable of becoming AGI/ASI, or if we need to make significant discoveries in terms of new architectures and/or training pipelines (new paradigms). Fortunately, that isn't relevant right now.

Alternatively, according to Claude:

The Displacement Imminent camp thinks current models already threaten mid-level knowledge work, and the curve is steep enough that AGI is a near-term planning assumption, not a thought experiment.

The Instrumental Optimist thinks current models are genuinely useful in a supervised workflow, trajectory is positive but uncertain, AGI is possible but not imminent. This is probably the modal position among working engineers who actually use these tools.

The Tool Not Agent camp thinks current models are genuinely useful as sophisticated autocomplete or search, but the "agent" framing is mostly hype — they fail badly without tight human scaffolding, and trajectory is uncertain enough that AGI is not worth pricing in.

The Stochastic Parrot camp (your skeptics, minus the pejorative) thinks the capabilities are brittle, benchmark gaming is rampant, and real-world coding performance is far below reported evals. They're often specifically focused on the unsupervised case and the question of whether the outputs are actually understood vs. pattern-matched.

The dimension you might also want to add explicitly is who bears the cost of the failure modes — because a lot of the disagreement between practitioners isn't about raw capability but about whether the errors are cheap (easily caught, low stakes) or expensive (subtle, compounding, hard to audit). Someone who works on safety-critical systems has a very different prior than someone shipping web apps.

Coding ability is more of a vector than it is a scalar. Using a breakdown helpfully provided by ChatGPT 5.2 Thinking:

Most arguments are really about which of these capabilities you think models have:

Local code generation (Boilerplate, idioms, small functions, straightforward CRUD, framework glue.)
Code understanding in situ (Reading unfamiliar code, tracing control flow, handling large repos, respecting existing patterns.)
Debugging and diagnosis (Finding root cause, interpreting logs, stepping through runtime behavior, reproducing bugs. Refactoring and maintenance)
Changing code without breaking invariants, reducing complexity, untangling legacy.
System design and requirements translation (Turning vague specs into robust design, choosing tradeoffs, anticipating failure modes.)
Operational competence (Tests, CI, tooling, dependency management, security posture, deploy and rollback, observability.)

Two people can both say “LLMs are great at coding” and mean (1) only vs (1)+(2)+(6) vs “end-to-end ticket closure.”

With terminology hopefully clarified, I come to the actual proposal:

@strappingfrequent (one of the many Mottizens I am reasonably well-acquainted with off-platform), has very generously offered:

A sizeable amount of tokens from his very expensive Claude Max plan ($200 a month!) and access to the latest Claude Opus.
His experience using agent frameworks and orchestration. I can personally attest that he was doing this well before it was cool, I recall seeing detailed experimentation as early as GPT-4.
His time in personally setting up experiments/tests, as well as overseeing their progress, while potentially interacting with an audience over a livestream.

He works as a professional programmer, and has told me that he has been consistently impressed by the capabilities of AI coding agents. They've served his needs well.

Here's his description of his skills and experience:

in my professional capacity, I've been working with Python for back-end (computer vision algorithms, FastAPI, Django) & Java (Spring). For Front-end; React. 95 percent of what I do is boilerplate, although Sonnet 3.5 did help me solve a novel problem last year but it did take quite a bit of back & forth -- the key was discussing what additional metrics I could capture to help nail down ~30+ parameters influencing a complicated computer vision pipeline.

tldr; the more represented your use case is in the training corpus, better results (probably) -- but I am absolutely confident that Opus 4.6 can help with novel problems, too. And, y'know -- Terrance Tao thinks that as well.

To what end?

He and I share a dissatisfaction with AI discourse that substitutes confident assertion for empirical investigation, and we think the most useful contribution we can make is to show the tools actually working on tasks that skeptics consider beyond their reach.

What do we want from you?

If you self-identify as someone who is either on the fence about LLMs, or strongly skeptical that they're useful for anything: share a coding challenge that you think they're presently incapable of doing, or doing well.

An ideal candidate is a proposal that you think is beyond the abilities of any LLM, while not being so difficult that we think they'd be entirely intractable. Neither of us claim that we can solve Fermat's Last Problem (or that Claude can solve it for us).

Other requirements:

A clear problem specification, or a willingness to submit a vaguer one and then approve a tighter version as created by us/Claude.
Nothing so easy/trivial that a quick Google shows that someone's already done it. If you want a C++ compiler written by an LLM, well, there's one out there (though that is the opposite of trivial).
Nothing too hard. He provides an example of "coding a Netflix clone in 4 hours".
An agreement on the degree of human intervention allowed. Can we prompt the model if it gets stuck? Help it in other ways? Do you want to add something to the scope later? (Strongly inadvisable). Note that if you expect literally zero human intervention, SF isn't game. He says: "I don't think I'd care to demonstrate any sort of zero-shot capacity... that's a silly expectation. If my prime orchestrator spends 30 minutes building a full-stack webapp that doesn't work I'll say It doesn't work; troubleshoot, please. I trust your judgement."
A time-horizon. Even a Max plan has its limits, we can't be expected to start a task that'll take days to complete.
Some kind of semi-objective rubric for grading the outcome, if it isn't immediately obvious. Is it enough to succeed at all? Or do you want code that even Torvalds can't critique? And no, "I know it when I see it" isn't really good enough, for obvious reasons. Ideally, give us an idea of the tests everything needs to pass.
If your task requires the model to review/extend proprietary code, that's not off the table entirely, but it's up to you to make sure we can access it. Either send us a copy or point us at a repo.
Nothing illegal.

But to sum, up, we want a task that we agree is probably feasible for an LLM, and where success will change your mind significantly. By which I mean: "If it succeeds at X, I will revise my estimate of LLM capability from Y to Z." Otherwise I can only imagine a lot of post-hoc goalpost movement, "okay but it still needed 3 prompts" or "the code works but a good programmer would have done it differently."

We reserve the right to choose which proposals we attempt, partly because some will be more interesting than others, and partly because we have finite tokens and finite time.

Miscellaneous concerns:

Why Claude Opus 4.6?

Well, the most honest answer is that @strappingfrequent already has an Claude Max plan, and is familiar with its capabilities. The other SOTA competitors include Gemini 3.1 Pro and GPT 5.3 Codex, which are nominally superior on a few benchmarks, but a very large fraction of programmers insist that Claude is still the best for general programming use cases. We don't think this choice matters that much, and the models are in fact roughly interchangeable while being noticeably superior to anything released before very late 2025.

Why bother at all?

We are hoping to change minds. This is disclosed upfront. We're also open to changing our own minds. We would also be genuinely interested to discover cases where we expect success and get failure - mapping the capability frontier is valuable independent of which direction the evidence points. We will share logs and a final repo either way. An interactive livestream is possible if there is sufficient interest.

Anything else?

You know the model. We'll be using Claude Code. The specifics of the demo are TBD, it could be a livestream with user engagement if there's sufficient interest, otherwise we can dump logs and share a final repo.

The floor is open. What do you think Claude cannot do?

Edit:

I'll forward the proposals to @strappingfrequent, assuming he doesn't show up in the thread himself. I am clearly not the real expert here, plus he's doing all of the heavy lifting. Expect at least a day or two of back and forth before we come to a consensus (but if he agrees to something, then assume I won't object, but not vice versa unless specifically noted), and that includes conversations within this thread to narrow down the scope and make things as rigorous as possible while adhering to the restrictions we've mentioned. At some point, we'll announce winners and get to scheduling.

Context

Rov_Scam self_made_human 4mo ago · Edited 4mo ago

Does it have to be a coding problem? I understand that there are time and financial constraints that prevent you from trying a lot of what is being requested, but I also understand @iprayiam3's criticism that it looks like you're cherry picking for something you thing the LLM can do. The problem is that for most people who aren't computer programmers they aren't going to be able to think of anything other than a piece of software that they wish existed but doesn't and ask you to write it from scratch, which is going to be cost prohibitive beyond the kind of textbook examples that were constructed for teaching purposes and don't address problems anyone is actually trying to solve. This seems like it should be marketing 101, but if you're trying to convince people that your product is worthwhile—and that's your stated goal—you have to show them that it will actually help them do something they want to do. If you tell me it can write code to fetch data from a REST API using asynchronous requests then I'll smile and nod but that's complete gibberish to me, and I won't know whether I should be impressed by it or not, or how that's supposed to improve my life.

So instead, I propose that we re-run the test I gave you last summer, because that is something I actually would use it for, and it obviously isn't too complicated. I'm an LLM skeptic, if you haven't noticed yet, but this is one of the things I think LLMs should be good at. For those who aren't going to click the link, the test was for the LLM to determine the release dates for various singles/albums based on a set of rules. I am extremely particular about my music collection and find the need to catalog everything down to the date of release, and that includes estimating dates when an exact one isn't available. I'm asking the LLM to automate what I already do myself. And I don't think this should be very complicated; in essence, what I'm asking it to do is query a series of databases, select a date based upon preference-ranked criteria, and potentially apply a mathematical calculation to that date. The hard part is that the databases are scattered across the internet, and some of them aren't formal databases but OCR scans of publications.

I had already tried this when OP asked for a challenge, and none of the models gave satisfactory results. I was assured that the new "reasoning" models that you had to pay for would do better. They did not. The first problem was that they were apparently unable to query some databases. The more concerning problem is that sometimes they queried the right databases but picked the wrong values. Sometimes they applied the rule incorrectly. The sample size wasn't large, but the models went 0/2. It's been several months since then, so maybe Round 2 will go better than Round 1? We can use the same releases as a preliminary test, but I recognize that the thread might have made it into training data or something since then so if it passes I'd prefer to run a more comprehensive test. There would also be a possible coding application here because if this were to work and I would use it I wouldn't want to query each release individually but would do batches (say, all the releases from a given artist) and export the data to an xml file or something that I could just refer to.

Another idea i had on similar lines would be for me to arbitrarily select a parcel of land in Westmoreland County, PA (selected because all of the recorded documents are available for free online) and see if it could download every deed in the chain of title going back 100 years. This particular task isn't hard to do but would be a proof of concept that it could possibly do more sophisticated work. I recognize that there are a number of scenarios that could arise that would completely flummox the LLM here. Given that, as a proof of concept I could run a few parcels in advance and preselect an easy one as proof of concept, though since LLM boosters like to brag about how powerful their models are I'm inclined to arbitrarily pick one without looking first and see how it does, especially since it cuts way down on the work I would need to do to verify the answer.

As a final option, if you're going to insist on a coding challenge, there's a feature in Photoshop that I've been hoping for for a long time but since it's for a niche application I doubt I'm ever going to get it. Part of being particular about my music collection means having cover art for everything, and a lot of the cover art just pulled straight from the internet is terrible, so I do a lot of cleaning it up. When all I have to use is images of 45 labels, I use a system to ensure that everything is consistent. I've automated most of this system with macros, but I still have to do the most time consuming part manually. A 45 label is donut-shaped. Ideally, the inside hole and outside edge of the label should be clean circles, though certain printing imperfections make ellipses a better option. Scans available online are photographed and have fuzzy edges, and the outside and inside have information that needs to be deleted to create a perfect white background. What I have to do to achieve this size the hole manually and hit delete. Photoshop has an area selection tool that can recognize the color change and select a large part of the area designated for deletion, but due to irregularities the edge is almost always irregular.

The tool I'm looking for would take these selections and normalize them to the nearest ellipse. The way I envision it working is that it would take a y-axis measurement, increase it by a few pixels to create a buffer, then take an x0axis measurement with a similar buffer increase, then create an ellipse based on those measurements (that's for the inside hole; the outside hole would be the same idea but would subtract from the axes to create the buffer). I wouldn't expect this to give perfect results 100% of the time, but it could work considerably less than that and it would speed things up significantly. The only reason I hesitate to propose this is that Photoshop isn't open source and I don't know how feasible it is to create plug-ins (they have some kind of system but I don't know enough about computers to know if what I'm asking for would work with it). I would be willing to settle for a GIMP plug-in as a proof of concept, but I absolutely despise GIMP so if it proves to work I'll have some serious soul-searching to do, and will probably request a lot more plug-ins to make it as much like Photoshop as possible.

birb_cromble Rov_Scam 4mo ago

For those who aren't going to click the link

Did you forget the link?

Rov_Scam birb_cromble 4mo ago

Lol, by the time I finished writing that I forgot that I was supposed to link something. Fixed.

What is this place?

Why are you called The Motte?

New post guidelines

Rules

Recommended Posts And Communities

Recommended Realtime Chats