This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.
Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.
We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:
-
Shaming.
-
Attempting to 'build consensus' or enforce ideological conformity.
-
Making sweeping generalizations to vilify a group you dislike.
-
Recruiting for a cause.
-
Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.
In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:
-
Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.
-
Be as precise and charitable as you can. Don't paraphrase unflatteringly.
-
Don't imply that someone said something they did not say, even if you think it follows from what they said.
-
Write like everyone is reading and you want them to be included in the discussion.
On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

Jump in the discussion.
No email address required.
Notes -
See For Yourself: A Live Demo of LLM capabilities
As someone concerned with AI Safety or the implications of cognitive automation for human employability since well before it's cool, I must admit a sense of vindication from seeing AI dominate online discourse, including on the Motte.
We have a wide-range of views on LLM capabilities (at present) as well as on their future trajectory. Opinions are heterogeneous enough that any attempt at taxonomy will fail to capture individual nuance, but as I see it:
LLM Bulls: Current SOTA LLMs are capable of replacing a sizeable portion of human knowledge work. Near-future models or future architectures promise AGI, then ASI in short order. The world won't know what hit it.
LLM moderates: Current SOTA models are useful, but incapable of replacing even mid-level devs without negative repercussions on work quality or code performance/viability. They do not fully substitute for the labor of the average professional programmer in the West. This may or may not achieve in the near future. AGI is uncertain, ASI is less likely.
LLM skeptics: Current SOTA models are grossly overhyped. They are grossly incompetent at the majority of programming tasks and shouldn't be used for anything more than boilerplate, if that. AGI is unlikely in the near-term, ASI is a pipedream.
Gary Marcus, the dearly departed Hlynka. Opinions not worth discussing.
Then there's the question of whether LLMs or recognizable derivatives are capable of becoming AGI/ASI, or if we need to make significant discoveries in terms of new architectures and/or training pipelines (new paradigms). Fortunately, that isn't relevant right now.
Alternatively, according to Claude:
The Displacement Imminent camp thinks current models already threaten mid-level knowledge work, and the curve is steep enough that AGI is a near-term planning assumption, not a thought experiment.
The Instrumental Optimist thinks current models are genuinely useful in a supervised workflow, trajectory is positive but uncertain, AGI is possible but not imminent. This is probably the modal position among working engineers who actually use these tools.
The Tool Not Agent camp thinks current models are genuinely useful as sophisticated autocomplete or search, but the "agent" framing is mostly hype — they fail badly without tight human scaffolding, and trajectory is uncertain enough that AGI is not worth pricing in.
The Stochastic Parrot camp (your skeptics, minus the pejorative) thinks the capabilities are brittle, benchmark gaming is rampant, and real-world coding performance is far below reported evals. They're often specifically focused on the unsupervised case and the question of whether the outputs are actually understood vs. pattern-matched.
The dimension you might also want to add explicitly is who bears the cost of the failure modes — because a lot of the disagreement between practitioners isn't about raw capability but about whether the errors are cheap (easily caught, low stakes) or expensive (subtle, compounding, hard to audit). Someone who works on safety-critical systems has a very different prior than someone shipping web apps.
Coding ability is more of a vector than it is a scalar. Using a breakdown helpfully provided by ChatGPT 5.2 Thinking:
Most arguments are really about which of these capabilities you think models have:
Local code generation (Boilerplate, idioms, small functions, straightforward CRUD, framework glue.)
Code understanding in situ (Reading unfamiliar code, tracing control flow, handling large repos, respecting existing patterns.)
Debugging and diagnosis (Finding root cause, interpreting logs, stepping through runtime behavior, reproducing bugs. Refactoring and maintenance)
Changing code without breaking invariants, reducing complexity, untangling legacy.
System design and requirements translation (Turning vague specs into robust design, choosing tradeoffs, anticipating failure modes.)
Operational competence (Tests, CI, tooling, dependency management, security posture, deploy and rollback, observability.)
Two people can both say “LLMs are great at coding” and mean (1) only vs (1)+(2)+(6) vs “end-to-end ticket closure.”
With terminology hopefully clarified, I come to the actual proposal:
@strappingfrequent (one of the many Mottizens I am reasonably well-acquainted with off-platform), has very generously offered:
A sizeable amount of tokens from his very expensive Claude Max plan ($200 a month!) and access to the latest Claude Opus.
His experience using agent frameworks and orchestration. I can personally attest that he was doing this well before it was cool, I recall seeing detailed experimentation as early as GPT-4.
His time in personally setting up experiments/tests, as well as overseeing their progress, while potentially interacting with an audience over a livestream.
He works as a professional programmer, and has told me that he has been consistently impressed by the capabilities of AI coding agents. They've served his needs well.
Here's his description of his skills and experience:
To what end?
He and I share a dissatisfaction with AI discourse that substitutes confident assertion for empirical investigation, and we think the most useful contribution we can make is to show the tools actually working on tasks that skeptics consider beyond their reach.
What do we want from you?
If you self-identify as someone who is either on the fence about LLMs, or strongly skeptical that they're useful for anything: share a coding challenge that you think they're presently incapable of doing, or doing well.
An ideal candidate is a proposal that you think is beyond the abilities of any LLM, while not being so difficult that we think they'd be entirely intractable. Neither of us claim that we can solve Fermat's Last Problem (or that Claude can solve it for us).
Other requirements:
A clear problem specification, or a willingness to submit a vaguer one and then approve a tighter version as created by us/Claude.
Nothing so easy/trivial that a quick Google shows that someone's already done it. If you want a C++ compiler written by an LLM, well, there's one out there (though that is the opposite of trivial).
Nothing too hard. He provides an example of "coding a Netflix clone in 4 hours".
An agreement on the degree of human intervention allowed. Can we prompt the model if it gets stuck? Help it in other ways? Do you want to add something to the scope later? (Strongly inadvisable). Note that if you expect literally zero human intervention, SF isn't game. He says: "I don't think I'd care to demonstrate any sort of zero-shot capacity... that's a silly expectation. If my prime orchestrator spends 30 minutes building a full-stack webapp that doesn't work I'll say
It doesn't work; troubleshoot, please. I trust your judgement."A time-horizon. Even a Max plan has its limits, we can't be expected to start a task that'll take days to complete.
Some kind of semi-objective rubric for grading the outcome, if it isn't immediately obvious. Is it enough to succeed at all? Or do you want code that even Torvalds can't critique? And no, "I know it when I see it" isn't really good enough, for obvious reasons. Ideally, give us an idea of the tests everything needs to pass.
If your task requires the model to review/extend proprietary code, that's not off the table entirely, but it's up to you to make sure we can access it. Either send us a copy or point us at a repo.
Nothing illegal.
But to sum, up, we want a task that we agree is probably feasible for an LLM, and where success will change your mind significantly. By which I mean: "If it succeeds at X, I will revise my estimate of LLM capability from Y to Z." Otherwise I can only imagine a lot of post-hoc goalpost movement, "okay but it still needed 3 prompts" or "the code works but a good programmer would have done it differently."
We reserve the right to choose which proposals we attempt, partly because some will be more interesting than others, and partly because we have finite tokens and finite time.
Miscellaneous concerns:
Why Claude Opus 4.6?
Well, the most honest answer is that @strappingfrequent already has an Claude Max plan, and is familiar with its capabilities. The other SOTA competitors include Gemini 3.1 Pro and GPT 5.3 Codex, which are nominally superior on a few benchmarks, but a very large fraction of programmers insist that Claude is still the best for general programming use cases. We don't think this choice matters that much, and the models are in fact roughly interchangeable while being noticeably superior to anything released before very late 2025.
Why bother at all?
We are hoping to change minds. This is disclosed upfront. We're also open to changing our own minds. We would also be genuinely interested to discover cases where we expect success and get failure - mapping the capability frontier is valuable independent of which direction the evidence points. We will share logs and a final repo either way. An interactive livestream is possible if there is sufficient interest.
Anything else?
You know the model. We'll be using Claude Code. The specifics of the demo are TBD, it could be a livestream with user engagement if there's sufficient interest, otherwise we can dump logs and share a final repo.
The floor is open. What do you think Claude cannot do?
Edit:
I'll forward the proposals to @strappingfrequent, assuming he doesn't show up in the thread himself. I am clearly not the real expert here, plus he's doing all of the heavy lifting. Expect at least a day or two of back and forth before we come to a consensus (but if he agrees to something, then assume I won't object, but not vice versa unless specifically noted), and that includes conversations within this thread to narrow down the scope and make things as rigorous as possible while adhering to the restrictions we've mentioned. At some point, we'll announce winners and get to scheduling.
Write a coherent thousand word post in your voice about a topic of your choosing sufficiently well to fool standard stylometry techniques, and pass the sniff test as sounding like you to others here, even given
By "you", do you mean me (self_made_human)? Oh boy, you'd be surprised. I've actively experimented with this, and it's one of my vibes benchmarks for any new model release.
Of course, my standards were that it should pass my own sniff test. That roughly amounts to "does that sound like me?" and "are those the kinds of arguments I'd make?". They do a decent job, and have for a while. Not perfect, but good enough to fool most of the people, most of the time.
I didn't/don't have access to "standard stylometry tools", though I will admit I didn't go looking very hard.
Also, I see several issues with this proposal:
Note that these aren't insurmountable challenges: if the final essay written by Claude falls within the same distribution as what I've been writing (with minor LLM involvement), then that's... fine? If nobody without access to the ground truth knows which essay is which, that's a victory as far as I'm concerned.
(We'd need to decide if I'm allowed to use LLMs myself, in the same manner I already do. Claude wouldn't get the benefit of me giving it feedback or editing its output in any capacity)
This would be an absolute pain to collate, both for me and for Claude. Then again, it might be easier for it, right until it ran into the fact that even the largest context windows would choke. I write a lot, and have for many years. Unfortunately, I'm not famous enough to be preferentially scraped, LLMs do not know who I am without looking it up. That's a luxury reserved for Gwern or Scott.
Would you be willing to pay for that or provide access? I wouldn't want to financially burden poor SF more than he already is; the Claude Max plan is a sunk cost, while these clearly represent additional investment. Genuine question, I'm sure that Claude could usefully interface with everything if provided access.
Just after I started writing this, the Metal Gear ! sound effect played in my head. Hang on a minute, I don't think I'd describe you as an LLM skeptic according to my (incomplete and rudimentary) taxonomy. I'd struggle to pin you as a moderate either, though I can't recall you expressing strong opinions that aren't more research/alignment oriented. If my hunch is true, then you aren't the target audience for this! Plus I'll eat my hat if you don't already have access to the best models out there. Do you really need us for the test, beyond whatever blinding or scraping of my writing is necessary?
Also, this is slightly off-topic. We're not evaluating the agents on their ability to write code, we're testing their ability to mimic me convincingly. Sure, coding capabilities might come into the picture, especially if they're in charge of other models, but it's not as central.
This doesn't strike me as insurmountable. I'd be open to trying once you get back to me on the issues I raised.
Ok, how about a simplified test. Write 500 words without AI on a topic of your choice, or pick any unpublished writing you have saved up. It's quite short so I don't think this is a major burden.
The challenge will be to have the AI create a 500 word passage on any topic, it doesn't have to be the same, where when placed side by side, it will not be obvious which passage is AI. Any means and methods including agents are permitted as long as all output tokens came from the AI model. Any verbatim copying of human written text outside of quotations is not permitted.
Verification will be done by comments on this forum, where anyone with an established account can vote for one being AI.
The result will be determined by a 1-sided Z-test with p=0.05. If voters on this forum overall can determine which one is AI with statistical significance, the AI has failed the vibe check.
Voters can use any means and mechanisms to detect AI.
Sounds interesting enough. I will note that using LLMs to write 500 words using my own work as a style reference and then just using that verbatim as a comment/post is not how I actually use them.
But as a general experiment? Sure, I'd be interested to see the results.
Does this preclude all human intervention after hitting go? Am I forbidden from telling the model that it has failed to capture my style or my opinions correctly, then either suggesting specific corrections or more broad advice?
You can guide and criticize the model as much as you want throughout the process, but none of your queries can be reproduced word for word in the final text.
Ok: the second sentence sounds bad, rewrite it.
Not ok: Try starting the second sentence with "However, this is not..." - This approach would result in words you wrote getting into the final output.
Hmm. I think that would be acceptable. Stand by for results, though it might take a while for us to hash it all out on our end.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
Huh. It's been my sniff test for new models as well, and so far I have not seen much success. It should be easy! This is literally the most LLM-flavored task to ever task! And yet. I've sunk probably 50 hours into it.
My most recent attempt, which I sunk about 10 hours and $100 into, and which got a lot closer than any previous attempts, involved giving Claude a corpus of all my past writing and having it try multiple different ways of producing text on arbitrary topics in my voice. The things tried were
On the one hand, I was very impressed by how good Claude was at running a whole bunch of these experiments very quickly. On the other hand, it did not work for me, not even at the level of "passes the sniff test", much less at the level of "standard stylometry techniques say it sounds like me".
I think you'll find that this is one of the tasks that is now much much easier. It's actually been within the capabilities of frontier models since Sonnet 4.0 (which is when I went ahead and gathered said corpus, on the theory that it'd be pretty useful to have). The prompt you're looking for is something like "Here's a chrome instance running with --remote-debugging-port and logged in on most of the sites I post on with a tab open for each. Go generate a corpus of all my publicly available writing".
Yeah. An H100 for 24h would run in the ballpark of $40, well worth it for me to provide. Vast allows transferring credits from one account to another, so I'd happily just transfer $50 of credits over if someone actually wants to do this. Does seem like rather a lot of work though.
Yeah, that's entirely reasonable. Your voice is very different from Claude's voice.
Yeah, I'm hoping you can prove me wrong here. I've been trying to do this since back in late 2019 when nostalgebraist-autoresponder was shiny and new. I want a good simulacrum of myself! I want to have that simulacrum, and I want to loom it. I want to build an exobrain, and merge with it, and fork off a copy running in the cloud.
BTW I expect there's a substantial market for anyone who manages to build this in a repeatable way. I've looked, and there are as of now no commercial offerings for this (though there are a few commercial offerings that pretend to be this).
I only have access to the models you can obtain access to with money - I expect I'm 3-6 months behind the best of what insiders at Anthropic or OAI have access to.
An LLM skeptic is an LLM idealist who's been disappointed :)
I expect looking like you stylometrically while also exhibiting the same patterns of thought you exhibit on a specific topic will involve writing code. But code in the service of trying to mimic you convincingly, rather than in the service of producing some specific durable software artifact.
For the record, I do expect this to be within the capability window within the next 18 months, but I would be pretty surprised if you managed to get Opus 4.6 specifically to do it.
I think we're on the same page here, I'll talk to SF about this. I'm willing to put in the effort on my end, which, as I see it, is to write a
1000 word essay as I normally would. Not particularly onerous.Let me give you an idea of how I normally approach this. I simply copy-paste pages of my profile after sorting by top, usually at least two or three pages (45k tokens). I might also share a few "normal" pages in chronological order, for the sake of diversity if nothing else.
I did just this, using Gemini 3.1 Pro on AI Studio (GPT 5.2 Thinking, which I pay for, can't write in arbitrary styles nearly as well no matter how hard you try, and I've tried a lot, I don't pay for Claude so I'm stuck with Sonnet):
I copied and pasted the first two profile pages, sorting by top of all time. Instructions were:
https://rentry.co/23dc63vs by Gemini https://rentry.co/p5yh68zu by Claude 4.6 Sonnet (same setup)
Results? I'd grade Gemini a 7/10, Claude a 5/10.
Looking at Gemini:
Looking closer:
I don't live or work near Bromley. That's where an uncle of mine resides. It's clear from the context I shared that I'm up in Scotland.
I could see myself saying this. Maybe not those exact figures, perhaps 10%:90%, but directionally correct.
Very good. I would use that verbatim in a real essay.
I wouldn't say that at all dawg. Why would I randomly reference my user flair in an essay?
Claude's version is shit. It's staggeringly content free, and while it's closer to "raw" me, it also uses em-dashes and uses many words to say few things. Maybe it's bad luck, I've had better results in the past, especially since I usually share a specific topic instead of letting it decide on its own.
Here is the whole prompt, profile dump included, if you want to try with a different model. I'll see about using Opus, I know 5.2 Thinking will shit the bed in a stylistic sense.Rentry won't let me paste the whole thing. But I think I've been clear enough to reproduce independently. I'll happily take a look.
Gemini's sample is impressive! Color me impressed, especially that a straight-up prompt produced that (though I suppose if any technique would get it with current models, it'd be "one shotting through a prompt" rather than "iterative refinement towards a target").
It doesn't sound quite the same as the version of you that lives in my head, but it's awfully close. E.g. I can't imagine you saying
since you don't tend to drop spurious technical details into your walls of text unless they serve a purpose (and also because I half suspect you're not a fan of the amyloid theory of alzheimers). More generally, the Gemini piece has a higher density of eyeball kicks than I model your writing as having. And I model your writing as having a lot of those, for a human.
It also seems to drift away from your voice in the second half. And it fails the stylometry vibe check - Pangram detects AI with medium confidence - but maybe in a way that's reparable. And actual stylometry (cohens d of +17 on dashes, +2 on words >9 letters, +1.5 on mean word length in general, -2 on 3-4 letter words, -1.2 on punctuation in general - i.e. you use more and more varied punctuation and shorter words, by a notable margin, and Gemini uses way, way, way more dashes). Still, it's much much better than I expected! (and yeah, the Claude one is not even worth discussing)
Interestingly, your results look much, much better to me than the ones I get myself. I ran the same test as you did against Gemini, and got these not-very-good attempts: 1 2 3. Gemini took distinctive phrases (e.g. "85% agree") and ideas (e.g. "claude code as supply chain risk") I have used once in the corpus, fixated on them, and stitched them together into a skinsuit which superficially resembles my writing but doesn't hold up under scrutiny. Interestingly, that's a very base model flavored failure mode. I have grown unused to seeing base-model-flavored failure modes, and as such Gemini is much more interesting to me now.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link