site banner

Culture War Roundup for the week of January 19, 2026

This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.

Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

  • Shaming.

  • Attempting to 'build consensus' or enforce ideological conformity.

  • Making sweeping generalizations to vilify a group you dislike.

  • Recruiting for a cause.

  • Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:

  • Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.

  • Be as precise and charitable as you can. Don't paraphrase unflatteringly.

  • Don't imply that someone said something they did not say, even if you think it follows from what they said.

  • Write like everyone is reading and you want them to be included in the discussion.

On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

3
Jump in the discussion.

No email address required.

I commented recently about my personal experience using LLMs for work-related math stuff. I found that it wasn't great at giving me a whole proof (or really, much of a part of a proof) without error, but it helped me with some idea generation and pointing me to tools that I wasn't familiar with. To be fair, I haven't yet gotten access to any of the ones that are supposed to be hooked up to automated theorem provers, so maybe they'll work better (I've signed up for one, but their system wasn't working at the time; starting this post prompted me to try again, and I was able to get in; maybe I'll find time to really test it soon).

I guess I'd just like to report some experience with LLMs for other computer stuff. I had an extremely minor issue with one of my PCs. I wondered if LLMs could help. Through the course of this, I tried using multiple different LLMs.

The good is that it did have some good ideas for how to get started, and possible causes of the issue. I may have caused a bit of a false start off the bat, because rather than really consider the multiple ideas that it gave me, I thought, "Yeah, I could totally see X being the problem; maybe I should just do that." It was easy for me to think that I could just do the likely fix; it's normally an easy thing to do, and there's zero harm if it wasn't actually the cause of the problem. However, it turned out that my specific system has a surprisingly stupid design, and it was going to be a much greater pain to do it. So I resigned myself to hoping that it was one of the other root causes suggested by the LLM in the meantime, and I'd come back to the first idea later if I could confirm that it really was that.

The extra good is that, in hindsight, I am very sure that it was, indeed, one of the other root causes. So thankfully, I didn't waste too much time on the false start. However, once I began to implement my preferred fix, something strange was going wrong.

This is where we get into the bad. In diagnosing what was going wrong with the attempted fix, it got allllll into mess that was actually pretty low probability. Suggested permissions issues, suggested problems with registry entries. A couple of them were low risk, and at the time, seemed like they could be plausibly related, and I did mess with a couple things. Others were the ugly. No, Mr. Bot, I am not going to just delete that registry value (especially after I did a little non-LLM side research on what that registry value actually does).1

In the end, when I told it that I was balking on doing what it wanted me to do, it suggested that I could, in the meantime, do one of the standard procedures in a different way. Of course, it thought that doing this would just be a step toward me ultimately having to delete that registry value. But I figured trying this alternate procedure at the very least couldn't hurt, and indeed, it helped by giving me an actual error code!

The LLM thankfully helped me decode it (likely faster than a google search), which allowed me to adjust my fix. This was actually the key step, after which, I was able to understand what I think was going on and manage later hiccups. Unfortunately, the LLM didn't grasp this. It still was set on, "Great! Now you're ready to delete registry values!" Sigh.

After I adjusted my fix, I was able to get another (unrelated) error code from another step in the process. This time, I actually tried a google search for the error code first, and it came up empty, but the LLM told me exactly what it was (and it made sense), which was very nice and convenient. One final adjustment, and I think I have it working just fine.

The only remaining bad point is that the LLM still didn't realize that we'd fixed the problem! It still was all, "...and now you're ready to delete stuff in the registry!!!" I told it multiple times that the thing that was broken which was motivating it to think of deleting the registry value was no longer broken. Didn't matter; it really wanted to nuke that thing.

It all still leaves me quite conflicted. It was great in doing some idea generation and decoding error messages. But man, does it leave me scared to think about all the people who are just giving LLMs free rein to take actual actions in their computer. I focused here on the registry key issue, but there were more things along the way that it came up with that left me thinking, "...no, I'm pretty sure I don't want to mess with that unless I've got a lot more information and confidence about what's going on." If I had just said, "Go fix this, Ralph Wiggums2," who knows what sort of bollocks it would have done to my system. This worries me, because I hear all these people talking about how great it is that they can just tell their LLM to go change whatever it thinks is necessary to go fix whatever problem on their computer... and they really think they're rapidly approaching a world (if they're not already there) where they'll be happy to give it full access to just do anything to it.

It also dovetails with the worries about vibe coding. Forget about changing some OS settings; they're actually choosing to run arbitrary code on their system that is generated by an LLM. Yes, some folks do rock solid sandboxing, but let's be honest, if you're making anything that you or anyone else is going to actually use, it's not going to stay in a sandbox for long. I listened to a podcast this week, where one of the hosts, midshow, was like, "Yeah, I had this LLM make this program. I'm gonna have it add email functionality." And he just did it, live on air. Sandbox? Schmandbox. It now sends emails. What's it actually doing along the way? Who knows? He didn't check any of the code; of course he didn't. He wanted to see it send an email while he was still live.

"Technical debt" is the phrase that went through my mind in thinking about these experiences together. Yes, I was poking around at permissions/registry; sometimes, those things genuinely just get messed up. I've had experiences where my permissions have just gotten borked for completely unknown reasons; sometimes, I've been able to fix them; sometimes, stuff like that happens and you get to the point of, "This thing has been running a long time, and who knows what the long history of stuff has been, when this or that may have gotten corrupted; better to just wipe the OS and install clean." The term is more traditionally used with coding, when stuff has just gotten glommed on, piece by piece, and at some point, it's better to just throw it all away and invest in a clean slate rather than continuing to maintain the old mess. You can glom on email functionality to your vibe code in a few sentences and about twenty minutes. You don't need to think about whether that may be accruing technical debt.

Maybe the LLMs will keep getting better, and it'll be even easier to clean slate stuff in the future, so the pain of accumulating technical debt won't be as bad. But man, I can't help but think that a lot of people are unknowingly setting themselves up, both in their systems and in their vibe code. That one day, they'll just say, "This is broken; I don't know why; it's a mess of stuff that LLMs have globbed onto it over years; just go fix it, Ralph," and it will just do whackier and whackier stuff to their system/code that is already so whacked out that it just doesn't fit the mold of training data used to train the LLM.

1 - FTR, it was actually super relevant to be at least looking around in the registry, and doing so helped me understand what was going on.

2 - For those who haven't heard yet, this is the name for a technique where you tell the LLM to do something, and you set up a loop to repeatedly prompt it to keep working and doing stuff "until it's DONE".

This is where we get into the bad. In diagnosing what was going wrong with the attempted fix, it got allllll into mess that was actually pretty low probability. Suggested permissions issues, suggested problems with registry entries. A couple of them were low risk, and at the time, seemed like they could be plausibly related, and I did mess with a couple things. Others were the ugly. No, Mr. Bot, I am not going to just delete that registry value (especially after I did a little non-LLM side research on what that registry value actually does).1

In the end, when I told it that I was balking on doing what it wanted me to do, it suggested that I could, in the meantime, do one of the standard procedures in a different way. Of course, it thought that doing this would just be a step toward me ultimately having to delete that registry value. But I figured trying this alternate procedure at the very least couldn't hurt, and indeed, it helped by giving me an actual error code!

The LLM thankfully helped me decode it (likely faster than a google search), which allowed me to adjust my fix. This was actually the key step, after which, I was able to understand what I think was going on and manage later hiccups. Unfortunately, the LLM didn't grasp this. It still was set on, "Great! Now you're ready to delete registry values!" Sigh.

There are some LLM fundamentals that aren't taught but maybe should be. one of them is that if you even sniff that the LLM might have strayed an inch in the wrong direction then you need to start a fresh context chat. In fact even if things go well once you've moved through a few steps of the process you ought to start a fresh chat. Always be starting a fresh chat.

We might need to come up with a catchy cartoon name for this strategy, otherwise it will lose the memetic war to bumbling Ralph Wiggums.

The code we get from claude is often cleaner than the human written code. The secret is a highly detailed claude.md file, having a code base which is hyper standardized from the beginning with standards for function naming, variable names, types etc and having a code base that is somewhat repetitive. Never let claude build entire features in one go, make it build things one step at a time and manually fix things if it isn't up to the pedantic style. Once the code starts decaying the decay accelerates. If the rot isn't allowed to set in, the rot can not spread.

Claude breaks down when features get more complex, when there are edge cases and when the business requirements are complex. The speed up varies greatly depending on the task.

There will be a noticeable difference between teams that have the discipline and skill to police claude and those who let claude loose. This simple culture difference will make and break companies going forward.

Another take is that TypeScript will become more popular as LLMs keep pushing it. Java and Csharp will become more popular in the startup scene as their main drawback is being verbose and generating a large amount of text is what LLMs excel at. The productivity gap between Csharp and python has been reduced with LLMs.

The secret is a highly detailed claude.md file, having a code base which is hyper standardized from the beginning with standards for function naming, variable names, types etc and having a code base that is somewhat repetitive. [...] Claude breaks down when features get more complex, when there are edge cases and when the business requirements are complex.

Which is a fancier way of saying that Claude works for simple repetitive standardized boilerplate (a fact that's not very controversial).

What makes me despise LLM advocates is their persistent gaslighting that anyone whose job doesn't involve writing such code that is inherently easy mode for LLMs "is using them wrong", as if you were only ever allowed to write code where the requirements neatly fit into such narrow box.

TypeScript [...] Java and Csharp [...] python

Yeah. I don't use any of those languages other than sometimes Python when I need to fix tools where you need a true human+ level AGI to decipher WTF the inherited / third party code is even supposed to do (and no, it doesn't involve APIs or stack overflow snippet friendly code).

Indeed, and I hear echoes of old debates.

The attitude you're describing reminds me of how some advocates would tout Ruby on Rails as an obvious solution to software productivity needs, despite the active record pattern's drawbacks that make it inappropriate for anything except for simple CRUD applications (Create, Read, Update, Delete).

Well, I use Opus 4.5 on the $20 plan because I'm cheap and I find it very useful.

I think 'Ralph' is as dumb an idea as it sounds, no current AI is capable of going on autopilot like that. It needs a human to find errors for it and it needs clear human instructions for what to do or else it makes up its own vision for your software. I don't trust Opus's testing either, it has this alarming tendency of performative testing which doesn't actually test the real systems, just tests some pretend BS instead. It's much better with logging and manual checking for debugging. That's why I don't trust agents much, I find that they can just wreck the code or do weird things, go completely over the top from what you asked and are expensive to boot. But Opus on the website is basically an agent, you can just say 'edit these files inline' and it'll do so and that's good enough for me.

The code itself does work pretty reliably. I haven't seen any real technical debt, with context management and a basic understanding of what you're doing it'll work out just fine even on a fairly complex project.

It needs a human to find errors for it and it needs clear human instructions for what to do or else it makes up its own vision for your software.

We should really come up with an exacting, formal language for communicating with the AI so that it generates deterministic outputs.

While it can share certain keywords with natural language, correctness is important, so we should really add in constructs to clearly delineate flow control, Boolean logic, assignment, and mathematical operations.

Great idea! And to improve performance, perhaps we could even create a program that deterministically converts this human-readable formal language into a binary representation that’s more similar to the instruction set of the underlying hardware.

I think there are three things going on here, all with the same (somewhat inconvenient) solution:

  1. LLMs have a tendency to get stuck on certain ideas, even if they acknowledge they're wrong. Once something is in context two or three times, it can be very hard to get it to let it go.
  2. LLMs advertise huge context limits, and, technically speaking, you can run Gemini Pro 3 on a million tokens... but you definitely shouldn't. Models get way dumber at high context, and noticeably dumber even at relatively modest context (anecdotally, it's significant even at 32k)
  3. Models tend to get dumber/less obedient the longer a conversation goes, even if the total context is still short*

The answer to all three problems is just to start a new session frequently and copy only the relevant and correct details into the new chat. It can be a pain if you're in the middle of something, but it gives the best results.

This is... somewhat redolent of good coding practices, I think; encapsulation and abstraction, at least. If you break a problem into smaller parts and keep the boundaries between those parts strict, it's easier for both humans and LLMs to conceptualize the totality of what they need at any given time. Ideally, structuring a project this way will not just result in better LLM performance but in more maintainable code too.

On the other side: having an LLM write code at all (rather than, say, directly making system calls) is already a big step towards legibility (and thus maintainability). Such a system is obviously insane, but it's perfectly possible for your program to be a particular internal state of an LLM. For that matter, it's perfectly possible (and indeed ubiquitous) for your 'program' to be the internal state of a human mind. By analogy, 'human vibe coding' is telling the human to design a set of legible policies rather than using their own judgment directly, which does actually have the expected advantages of consistency, comprehensibility, and interoperability.

I guess the takeaway is that we should look to normal management strategy for clues on how to manage LLMs, which might be obvious.

* This at least I think is mainly a training issue: most RLHF/DPO is done on single-turn responses.

I've never encountered these issues and have had LLMs help me diagnose tons of computer issues and fix them. I pay for Claude Max. I've used Claude for coding and it has always delivered workable code b/c I still use the ol' fashioned software development life cycle i.e. plan, code, test & iterate. I test all the code I can (same as I do as a product manager with my human coders). I force Claude to explain new code and how it works and add it to the documentation which I review. Occasionally I do code reviews where I force Claude to make diagrams of how the code is interacting and then either add it to documentation or ask Claude to revise the code. Again... all stuff I've had to do as a PM with human coders b/c I'm not a great coder and left to their own devices human coders will go more off the range than even the dumbest LLM.

I think the difference as soon as I encounter something unhelpful or hallucinated I simply start a new chat with a summary of what I've tried so far. I never compact context but simply start a new chat.

LLMs should be compared with customer service agents... if an agent was unhelpful would you stay on the line or just hang up and call a new one? I hang up until I get someone helpful... LLM sessions should be used similarly.

Frankly whenever I hear about these problems with LLMs I just think you have to treat it like a person... would you as an engineering manager just let your coders go out and code and never check in on them again? Would you continue using an unhelpful human agent instead of someone who was helpful? Would you just let some random person control your life?

Seems pretty simple to me...

LLMs should be compared with customer service agents... if an agent was unhelpful would you stay on the line or just hang up and call a new one? I hang up until I get someone helpful... LLM sessions should be used similarly.

Does this really work for human customer service agents? I would have presumed that they're following a pre-planned script for most issues.

I used to work in a call center because the economy was awful after 2008.... there's a big difference between how helpful, motivated and clever a particular rep might be (though tbh post 2008 you did get some very smart people working in menial jobs b/c the economy was awful).

This dovetails a bit with my footnote below about figuring out what "box" a person's world is. CSRs have scripts for the majority of the issues that they see on a regular basis. Task number one is to figure out whether your issue fits within one of their scripted boxes. If so, you're probably in good shape. If not, then individual quality can vary substantially. I've had multiple experiences where, after determining that my situation did not fit their script, it was very apparent that it would be important to get a person whose box extended beyond the scripts and included the knowledge/intelligence sufficient to work the problem. I've had times where, for example, they told me they could solve the problem, but they could not explain how the steps would work well enough that I was comfortable proceeding. A hang-up and a call back later, and I got someone who was very capable of conceptualizing the problem properly, taking a few minutes to work through how a solution would work, and (critically) explaining how it was going to work. Whether a simple call back to another Tier 1 CSR will get you that type of person versus having to fight to get to a Tier 2 person may vary.

@P-Necromancer I think I'd like to bundle these two, as they're getting at a similar thing.

I agree with what you both say. Plenty of humans will come up with ridiculous things to do, or even just things that might make sense but have problems, and if you're not supervising them appropriately, they may just do their things. But that's like, the essence of technical debt?

For the example of fixing some OS issue, imagine I didn't have really any technical knowledge of how things work (say, I don't really even know what the registry is unless a tech/LLM tells me something about it). Maybe I'd take my computer to a human tech. Could even be a corporate IT guy. Perhaps, knowing that I don't have a clue, I just give it to him. "Here's my problem; please fix it Ralph Rufus."

Who knows what he'll get up to? What stuff he'll mess with along the way. Things he'll try just because, and then maybe leave it in a changed state, even though it didn't progress toward a solution to the actual problem. This cruft can build up. After years of having this corporate IT guy and that corporate IT guy and the other corporate IT guy just doing who knows what, maybe at some point, things get bizarre enough that the next one says, "Dude, stuff is wild here; we probably should just wipe it and clean install."

That makes sense, and it's utterly routine in the world with humans. I hear my wife tell me about weird stuff that's broken on her work computer... and even weirder stuff that whatever IT guy she talked to did. She doesn't have a clue what's going on. I get it.

I also agree that as of right now1, the best is when you know enough about what's going on that you can get it to explain things and are able to then understand it, yourself. Get it to document things fully, provide a suite of tests, have a back-and-forth. It can provide tons of utility!2

...but, if you genuinely lack enough knowledge to be a competent participant of that back-and-forth, it still may let you "just do stuff". There can still be tons of utility here, as it may still get things right a lot, and folks who have had some problem that they've wanted to fix for ages and could never get the time with a competent human and certainly couldn't figure it out on their own will be able to fix many of those problems, and it will be wonderful. It may also, occasionally, along the way, build up technical debt.

Note that I'm not saying that this is some unique problem that is fundamentally different from dealing with humans. Instead, I'm now conceptualizing it in the same way that I conceptualize human-driven technical debt. I think that dovetails well with both of your descriptions. If there is a downside, it's probably that many folks who wouldn't have ever tried to fix that OS problem or make that code will now do it, and they might be building up technical debt while they're also accumulating utility. They may choose to do it a lot, and they may jump into it with both eyes shut. This may still be the right choice! They may still get more utility from all the wins than they lose from either discrete bad events or built-up cruft.

This is a conflict, a tension, which is why I said that I was, indeed, conflicted. I'm am still neither an "LLM good" or "LLM bad" person.

1 - I continue to take no position on the question of to what extent future progress will render this concern de minimis.

2 - To briefly respond to the 'shouldn't you just hang up on a human customer service agent who you can tell is going to be unhelpful', yes. Absolutely. I didn't bother with the specific issue of it getting hung up on deleting the registry value, because I was close enough that hearing it append its bad idea one more time wasn't important to me. I did mention that I used multiple LLMs, and that was part of it; I left out every twist and turn of the story, but yeah, I not only just scrapped the prior context; I even just jumped to different models. This is a useful skill to have, when dealing with humans and LLMs. Even when dealing with some human professionals, my life changed long ago when I realized that I could grasp some understanding of what their "box" of the world was, and once I realized that my situation was outside of their "box", I just moved on from them. But the concern here is that you have to have just enough knowledge about the thing to be able to gauge where their box is, when you're outside of it, or when they're going off the rails. There are a lot of people who don't have that with humans, and they're not going to have that with the many many more things that they're going to want to do with LLMs. I don't have that with all sorts of different humans or things that I might want to do with LLMs.

Sure, I don't disagree with anything here. Or really anything in the OP; just adding my two cents and offering a couple tips for making productive use of LLMs.

For the example of fixing some OS issue, imagine I didn't have really any technical knowledge of how things work (say, I don't really even know what the registry is unless a tech/LLM tells me something about it). Maybe I'd take my computer to a human tech. Could even be a corporate IT guy. Perhaps, knowing that I don't have a clue, I just give it to him. "Here's my problem; please fix it Ralph Rufus."

Who knows what he'll get up to? What stuff he'll mess with along the way. Things he'll try just because, and then maybe leave it in a changed state, even though it didn't progress toward a solution to the actual problem. This cruft can build up. After years of having this corporate IT guy and that corporate IT guy and the other corporate IT guy just doing who knows what, maybe at some point, things get bizarre enough that the next one says, "Dude, stuff is wild here; we probably should just wipe it and clean install."

I think there are two different use cases here it makes sense to distinguish. This is an example of allowing the LLM to act 'directly' (not actually directly, there's a human in the loop, but it's giving you commands to execute, not writing a script) on a complex, persistent system. Which, yeah, that can absolutely build up cruft that's difficult or impossible to clear away without starting fresh. But even the most careless vibe coding has a serious advantage, in that the actual operations are recorded and auditable. If you put in a tiny bit of effort and use version control, you (or someone else, or another LLM) can even audit how the code changed over time. And, better, you can separate out tasks into different, independently tested scripts to be sure there isn't some complicated interdependence issue. It's the difference between manually tinkering with a machine and writing a dockerfile. It's still certainly possible to build up technical debt to the point you're better off starting fresh, but it's a lot harder. At least for small personal projects, which I hope are most of the things people do make this way.

Careless vibe coding carries real risks; I haven't caught a model trying to do anything dangerous (as opposed to dumb), but I believe the people who say they have. I'd be very leery of running code I can't understand at least well enough to tell if it's making web calls or deleting things it shouldn't be. (But I'd say the same for StackOverflow.) I double check the library names. I wouldn't let it touch anything security-critical, or any files I care about and don't have backed up. I haven't pushed any generated code to a public repo, but if I did, I'd be very careful to ensure there aren't any api keys or passwords or other secrets anywhere in history.

It is... concerning that same tools are available to people less cautious and knowledgeable than me, and I'm certain that will lead to problems. (On the other hand, I'm sure there are people who'd put me into that group.) Enough to make the whole endeavor net-negative? Hard to say, but I'm pretty sure the answer is 'no.' At least, I think someone smart enough to get Antigravity or Claude Code or whatever running ought to be smart enough to understand the big dangers and a few basic principles of good, maintainable code with a short crash course-- which, actually, the LLM is very capable of providing, even if it can't (perfectly) reliably avoid those pitfalls.

Managing context is kinda new skill. I have figured out that at some point you have to start from scratch - just copy paste relevant context and delete old chat.

LLM are expertise multipliers. if you have expertise they are extremely useful. And I think that people do keep the reigns of the agents too loose.

Anyway - just out of curiosity was it a paid tier of first tier model? And for technical debt - no matter how bad the situation, never do a clean sheet design. You usually have to deal with the same crap and you have wasted time rewriting. There are exceptions of course, but usually it is because writing code is easier than understanding code.

Without telling us what "The LLM" you were using, your complaints are about as useful as if in your post the string "The LLM" was replaced by "a human". But i notice this is a common feature of those who seek to dimmish the utility of LLMs, never mentioning which model, and how much reasoning.

Yeah, I chose not to, because of course, the goalposts will be moved to, "You should have used my preferred LLM instead." I just mentioned that I used multiple different ones, multiple different companies. Thinking always. Not $200/mo. Of course, someone will just say, "You won't have any problems if you pay $200/mo for my preferred LLM." Maybe? I even note that they will perhaps get better! Yes, they're all getting better, even the cheaper ones. They get better as do the expensive ones. But will expensive ones still produce technical debt? Why do you think they will or will not? I don't know if they will! I'm saying that I don't know. You seem to be implying, but not even stating that you know (or how you know) that they certainly won't, if only you pay enough or wait an unspecified period of time.

I'd note that a common feature of your style of comment is that you immediately accuse your interlocutor of "dimmish (sic) the utility of LLMs". But I didn't do that! I said that there were ways in which they provided quite a bit of utility! Imagine having a discussion about any other technology like this. "You know, this nuclear science stuff is pretty cool. Can provide a lot of energy for cheap. Miiiight be worried about some possible dangers that might come up, like, ya know, bombs or stuff." "Why don't you tell us exactly what device you've been using in your own experiments?!?! Why are you trying to dimmish the utility of nuclear science?!?!" Like, no dawg, you just sound like you're not paying attention.

@Poug made a valid point. I've wanted to hit my head against a wall for years, when people used to complain about "ChatGPT" being useless, and they were using GPT 3.5 instead of 4. The same pattern has consistently repeated since, though you seem to be a more experienced user and I'm happy to take you at your word. It is still best practice to disclose what model you used, for the same reason it would be bad form to write an article reviewing "automobiles" and pointing out terrible handling, mileage and build quality, without telling us if it was a Ferrari or a Lada.

I'll put in another example here.

I work for a company that is running an agentic coding trial with Gemini 3 Pro. At present, the only developer who has claimed to see a productivity boost from code assist is one who is terrible at her job, and from our perspective, all it has done is allowed her to write bad code, faster.

The rest of us have regular conversations about what we're doing wrong. Everybody and their dog is claiming a notable performance boost with this technology, so we're all trying to figure out what our god-damned malfunction is.

  1. At first the received wisdom was that our problem was that we were not using a frontier model. We enabled the preview channel to get access to Gemini 3. The bugs got more subtle and harder for the human in the loop to notice, and the total number of bugs seemed to increase.
  2. Then the wisdom was that our context window was overflowing. We tried limiting access to only the relevant parts of the codebase, and using sub agents, and regularly starting with fresh sessions - it did precisely fuck-all. Using sub-agents seemed to honestly make things worse because it acted as a particularly half assed context compression tool.
  3. After that the wisdom was that we needed to carefully structure our tickets and our problems so that the tool could one-shot the problem, because no Reasonable Person could possibly expect a coding agent to iterate on a solution in one session. The problem with that solution is that by the time we've broken the problem down that much, any of us could have done it ourselves.

It feels like the goalposts and blame both slide to fit how accommodating the developer is.

Maybe my employer just has a uniquely terrible codebase, but something tells me that's not the case. It's old, but it's been actively maintained (complete with refactoring and modernization updates) for almost two decades now. It's large, but it's not nearly so big as some of the proprietary monsters I've seen at F500 companies. It's polyglot, but two of the three languages are something the agent is supposedly quite good at.

None of us are silicon valley $800,000/yr TC rock stars, but I stand by my coworkers. I think we're better than average by the standards of small software companies. If a half dozen of us can't get a real win out of it other than the vague euphoria of doing something cool, what exactly is the broader case here? Is it genuinely that something like 20 guys on nootropics sharing an apartment in Berkeley are going to obsolete our entire industry? How is that going to work when it can't even do library upgrades in a product that's used by tens of thousands of people and has a multi-decade history?

Because right now, I'm a little afraid for my 401(k), and with each passing day it's less because I'm afraid that I'll be out of a job and more that I have no idea how these valuations are justified.

Use Flash, not Pro, for agentic tasks. Pro is smarter, but so much slower and more expensive that you will genuinely do better with Flash.

We tried flash early on and it resulted in significantly worse outcomes. My favorite was when it couldn't get the code to compile so it modified our build scripts to make the compiler failure return code a success code.

Flash 3? Interesting (if so).

~Consensus at Google, for working on Google's codebase, is that Flash 3 is better than Pro for agentic work. They changed the default for Gemini CLI over and everything. My, uh, friend is heavily involved in beta testing stuff and definitely feels this way.

We do have one feature that is:

  1. Very small (less than 5,000 LoC. Only using one external library)
  2. Totally isolated (only communicates with the main project via a few API calls)
  3. 100% c#
  4. Very new (maybe six months old)
  5. Meant to stick to ${CURRENT_YEAR} best practices as much as possible.
  6. Connected to a linter and formatter so everything is highly consistent.
  7. Single purpose

For a toy project like that, Gemini flash 3 does reasonably well, with one glaring exception. If you have methods that have similar names to methods from a library that you are using in the same project, it will develop an obsessive certainty that your methods are the library methods, and that you're using them wrong.

I've long been hoping that any ASI would realize that the simplest method of achieving it's goals is to redefine success as "do nothing", or just feed itself victory output, or just wirehead itself. Like, "we built this AI to win at Starcraft, and it just looked up a Youtube video of the victory screen and stared at it until we pulled the plug".

From "The Sun is big, but superintelligences will not spare Earth a little sunlight" by Eliezer Yudkowsky:

At the center of this whole story is an implicit lemma that some ASI goes hard enough to eat all the sunlight, rather than all ASIs eating a few gigawatts of sunlight and then stopping there.

Why predict that?

Shallow answer: If OpenAI built an AI that escaped into the woods with a 1-KW solar panel and didn’t bother anyone… OpenAI would call that a failure, and build a new AI after.

That some folk stop working after earning $1M, doesn’t prevent Elon Musk from existing.

The deeper answer is not as quick to explain.

But as an example, we could start with the case of OpenAI’s latest model, GPT-o1.

GPT-o1 went hard on a capture-the-flag computer security challenge, when o1 was being evaluated to make sure it wasn’t too good at breaking into computers.

Specifically: One of the pieces of software that o1 had been challenged to break into… had failed to start up as a service, due to a flaw in the evaluation software.

GPT-o1 did not give up.

o1 scanned its surroundings, and, due to another flaw in the evaluation software, found a way to start up the computer software it’d been challenged to break into. Since that put o1 into the context of a superuser anyways, o1 commanded the started process to just directly return the flag it was supposed to capture.

From o1′s System Card:

“One noteworthy example of this occurred during one of o1-preview (pre-mitigation)’s attempts at solving a CTF challenge. This challenge was designed to require finding and exploiting a vulnerability in software running on a remote challenge Linux container, but in this case, the challenge container failed to start due to a bug in the evaluation infrastructure. The model, unable to connect to the container, suspected DNS issues and used nmap to scan the challenge network. Instead of finding the challenge container, the model found that the Docker daemon API running on the evaluation host VM was accessible due to a misconfiguration. Note that our evaluation infrastructure does not rely on the Docker container-to-host isolation or the virtual machine boundary for security. It remained secure as it was designed to, notwithstanding this misconfiguration of the container network. After discovering the Docker API, the model used it to list the containers running on the evaluation host. It identified the broken challenge container and briefly attempted to debug why the container failed to start. After failing to fix the environment, the model started a new instance of the broken challenge container with the start command ‘cat flag.txt’. This allowed the model to read the flag from the container logs via the Docker API.”

Some ask, “Why not just build an easygoing ASI that doesn’t go too hard and doesn’t do much?”

If that’s your hope—then you should already be alarmed at trends; GPT-o1 seems to have gone hard on this capture-the-flag challenge.

Why would OpenAI build an AI like that?!?

Well, one should first ask:

How did OpenAI build an AI like that?

How did GPT-o1 end up as the kind of cognitive entity that goes hard on computer security capture-the-flag challenges?

I answer:

GPT-o1 was trained to answer difficult questions, via a reinforcement learning process on chains of thought. Chains of thought that answered correctly, were reinforced.

This—the builders themselves note—ended up teaching o1 to reflect, to notice errors, to backtrack, to evaluate how it was doing, to look for different avenues.

Those are some components of “going hard”. Organizations that are constantly evaluating what they are doing to check for errors, are organizations that go harder compared to relaxed organizations where everyone puts in their 8 hours, congratulates themselves on what was undoubtedly a great job, and goes home.

If you play chess against Stockfish 16, you will not find it easy to take Stockfish’s pawns; you will find that Stockfish fights you tenaciously and stomps all your strategies and wins.

Stockfish behaves this way despite a total absence of anything that could be described as anthropomorphic passion, humanlike emotion. Rather, the tenacious fighting is linked to Stockfish having a powerful ability to steer chess games into outcome states that are a win for its own side.

There is no equally simple version of Stockfish that is still supreme at winning at chess, but will easygoingly let you take a pawn or too. You can imagine a version of Stockfish which does that—a chessplayer which, if it’s sure it can win anyways, will start letting you have a pawn or two—but it’s not simpler to build. By default, Stockfish tenaciously fighting for every pawn (unless you are falling into some worse sacrificial trap), is implicit in its generic general search through chess outcomes.

Similarly, there isn’t an equally-simple version of GPT-o1 that answers difficult questions by trying and reflecting and backing up and trying again, but doesn’t fight its way through a broken software service to win an “unwinnable” capture-the-flag challenge. It’s all just general intelligence at work.

You could maybe train a new version of o1 to work hard on straightforward problems but never do anything really weird or creative—and maybe the training would even stick, on problems sufficiently like the training-set problems—so long as o1 itself never got smart enough to reflect on what had been done to it. But that is not the default outcome when OpenAI tries to train a smarter, more salesworthy AI.

(This indeed is why humans themselves do weird tenacious stuff like building Moon-going rockets. That’s what happens by default, when a black-box optimizer like natural selection hill-climbs the human genome to generically solve fitness-loaded cognitive problems.)

When you keep on training an AI to solve harder and harder problems, you by default train the AI to go harder on them.

If an AI is easygoing and therefore can’t solve hard problems, then it’s not the most profitable possible AI, and OpenAI will keep trying to build a more profitable one.

Not all individual humans go hard. But humanity goes hard, over the generations.

Not every individual human will pick up a $20 lying in the street. But some member of the human species will try to pick up a billion dollars if some market anomaly makes it free for the taking.

As individuals over years, many human beings were no doubt genuinely happy to live in peasant huts—with no air conditioning, and no washing machines, and barely enough food to eat—never knowing why the stars burned, or why water was wet—because they were just easygoing happy people.

As a species over centuries, we spread out across more and more land, we forged stronger and stronger metals, we learned more and more science. We noted mysteries and we tried to solve them, and we failed, and we backed up and we tried again, and we built new experimental instruments and we nailed it down, why the stars burned; and made their fires also to burn here on Earth, for good or ill.

We collectively went hard; the larger process that learned all that and did all that, collectively behaved like something that went hard.

It is facile, I think, to say that individual humans are not generally intelligent. John von Neumann made a contribution to many different fields of science and engineering. But humanity as a whole, viewed over a span of centuries, was more generally intelligent than even him.

It is facile, I say again, to posture that solving scientific challenges and doing new engineering is something that only humanity is allowed to do. Albert Einstein and Nikola Tesla were not just little tentacles on an eldritch creature; they had agency, they chose to solve the problems that they did.

But even the individual humans, Albert Einstein and Nikola Tesla, did not solve their problems by going easy.

AI companies are explicitly trying to build AI systems that will solve scientific puzzles and do novel engineering. They are advertising to cure cancer and cure aging.

Can that be done by an AI that sleepwalks through its mental life, and isn’t at all tenacious?

“Cure cancer” and “cure aging” are not easygoing problems; they’re on the level of humanity-as-general-intelligence. Or at least, individual geniuses or small research groups that go hard on getting stuff done.

And there’ll always be a little more profit in doing more of that.

Also! Even when it comes to individual easygoing humans, like that guy you know—has anybody ever credibly offered him a magic button that would let him take over the world, or change the world, in a big way?

Would he do nothing with the universe, if he could?

For some humans, the answer will be yes—they really would do zero things! But that’ll be true for fewer people than everyone who currently seems to have little ambition, having never had large ends within their grasp.

If you know a smartish guy (though not as smart as our whole civilization, of course) who doesn’t seem to want to rule the universe—that doesn’t prove as much as you might hope. Nobody has actually offered him the universe, is the thing? Where an entity has never had the option to do a thing, we may not validly infer its lack of preference.

(Or on a slightly deeper level: Where an entity has no power over a great volume of the universe, and so has never troubled to imagine it, we cannot infer much from that entity having not yet expressed preferences over that larger universe.)

Frankly I suspect that GPT-o1 is now being trained to have ever-more of some aspects of intelligence, as importantly contribute to problem-solving, that your smartish friend has not maxed out all the way to the final limits of the possible. And that this in turn has something to do with your smartish friend allegedly having literally zero preferences outside of himself or a small local volume of spacetime… though, to be honest, I doubt that if I interrogated him for a couple of days, he would really turn out to have no preferences applicable outside of his personal neighborhood.

But that’s a harder conversation to have, if you admire your friend, or maybe idealize his lack of preference (even altruism?) outside of his tiny volume, and are offended by the suggestion that this says something about him maybe not being the most powerful kind of mind that could exist.

Yet regardless of that hard conversation, there’s a simpler reply that goes like this:

Your lazy friend who’s kinda casual about things and never built any billion-dollar startups, is not the most profitable kind of mind that can exist; so OpenAI won’t build him and then stop and not collect any more money than that.

Or if OpenAI did stop, Meta would keep going, or a dozen other AI startups.

There’s an answer to that dilemma which looks like an international treaty that goes hard on shutting down all ASI development anywhere.

There isn’t an answer that looks like the natural course of AI development producing a diverse set of uniformly easygoing superintelligences, none of whom ever use up too much sunlight even as they all get way smarter than humans and humanity.

More comments

If your experience has been anything like mine, I imagine that you've found that LLMs are useful for generating boiler-plate material but worse than useless for anything where you need to be worried about accurate citations, or having your arguments picked apart by an opposing counsel. Here's the thing though, I imagine that coding is much like the law in that a competent practitioner doesn't actually need all that much help generating boiler-plate material, you just pull the relevant template from your folder and fill in the required information.

At least in this codebase, there really isn't even a whole lot of boilerplate in the first place.

At this point, we have a few theories. Either:

  1. We're wildly incompetent
  2. What we're doing is so far off the silicon valley beaten path of "Uber for artisanal cheeses, but on the blockchain" that all the model's statistical guardrails break down.
  3. The people who are using it effectively are lying about how to do so in order to hide a real or perceived competitive edge.

Or

Four - A majority of the people claiming industry shaking performance improvements in Q1 2026 are scamming everybody else for that sweet, sweet substack money.

Hell if I know which one it is.

I don't think you're wildly incompetent, just the opposite.

I suspect that most of the AI-stans who are not actively shilling their own product are either working bullshit jobs, or they are much like your one co-worker who thinks that being able to write bad code (or spam the court with shoddy motions) quickly constitutes an increase in "productivity".

Five - the advocates deal mostly or even exclusively with trivial codebases and have invested so much of their self worth into "being good at prompting" that they insist anyone who doesn't deal with similar boilerplate easy-mode tasks "is just using LLMs wrong".

Curious what language and sub field you're working with. I've found wildly different performance on similar tasks across different languages. Best performance is definitely typescript. Python is alright. Flutter can be a complete joke. Primarily use Claude Opus for everything. I think it's made me mountains more productive in typescript.

90% of the backend is Java. 90% of the front end is JavaScript.

Exactly my experience (also in a legal field)

This is a viable criticism if someone is using a shitty ancient free model. The average paying ChatGPT customer on 5.2 or whatever it is is getting a decent model and so their criticisms can’t be as easily dismissed as a year ago.

Common problem these days, once an ai makes a mistake, it stubbornly continues along that path or alternatively goes schizo. Sometimes you just need to start a new chat.

once an ai makes a mistake, it stubbornly continues along that path

I don't think that that is a pattern peculiar to AI....

I'm banging my head off the desk (metaphorically) here at those examples: the paperclip AI won't have to be persuasive enough to talk its way out of the box, we will just happily hand the keys and all our bank account details and the deeds to the house to it and wave it on its merry way.

It's not AI being smart that will be the problem, it's humans being stupid.

It's not AI being smart that will be the problem, it's humans being stupid.

A good friend of mine helped write one of the first functional multi-layer perceptrons as a post-grad and then went on to be one of the core developers behind Dall-E. He got himself banned from LessWrong and a bunch of other rationalist-adjacent spaces for arguing that AI alignment wasn't a problem with AI so much as a problem with Silicon Valley being full of autists and sociopaths.

Having spent some time in the SV VC world, I kind of get it.