This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.
Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.
We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:
-
Shaming.
-
Attempting to 'build consensus' or enforce ideological conformity.
-
Making sweeping generalizations to vilify a group you dislike.
-
Recruiting for a cause.
-
Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.
In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:
-
Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.
-
Be as precise and charitable as you can. Don't paraphrase unflatteringly.
-
Don't imply that someone said something they did not say, even if you think it follows from what they said.
-
Write like everyone is reading and you want them to be included in the discussion.
On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

Jump in the discussion.
No email address required.
Notes -
I haven't posted too much about AI on here, largely because my own personal experiences with using it have been boring and underwhelming. Generating offensive memes (9/11 gender reveal, racial stereotypes, etc.) is my most positive interaction with AI. And partly because I find the pro-AI "AGI is just around the corner bro!" crowd obnoxious as hell, and I find that most discussions about it depend on accepting certain massive assumptions about what we actually do (and don't) know about the nature of intelligence, consciousness, the human brain, etc. For the purposes of making my biases clear up front: Personally, I'm religious and believe in the existence of a human spirit/soul, so I'm already strongly biased against claims that consciousness is an emergent property of sufficiently advanced systems or any arguments along those lines.
Regardless, a few developments have happened recently that have motivated me enough to actually make a top-level post about this. The first being my (employer-mandated) use of Claude to generate code. "You're not using the latest model, just one more model and we'll reach AGI"-bros officially in shambles after this one. I have an HTTP API client library I wrote a few years ago for interacting with a 3rd party API. There's a good amount of duplicate logic throughout for things like setting up and making the requests, caching, etc. I asked Claude to look over the code and extract out the duplicate logic into a single implementation Here's how it messed up just the authentication part of it:
This was with the latest version of Claude Sonnet. We don't have access to the latest version of Opus, but I'm sure an AI-bro on here would insist that Opus would totally get this right. Regardless, it failed spectacularly at what would be an easy (but tedious) task for a mid-level developer and above (or a sufficiently talented junior).
The second happening is the ARC prize people releasing version 3 of their AGI test suite, a series of puzzle games. They released it within a few hours of Jensen Huang saying he thinks the latest and greatest models are capable of AGI. Humans were capable of solving 100% of the puzzles. The highest scoring AI couldn't complete more that 0.5%.
I'm willing to bet future models will do st least somewhat better on this, but only because I'm maximally cynical and I fully expect these puzzles to be included in the training set for future SOTA models.
I tried several of the puzzles myself, and none of them are terribly difficult. I'd estimate that anyone in the 100-110 IQ range or higher would be able to solve most or all of them. This development has further reinforced my belief that LLMs are basically just really advanced statistical regression models on crack, but nothing approaching what we would consider actual intelligence or conscious thought (and this is before we get into Chinese Room style criticisms of them).
In any case, I'm curious to see what you all think of these. Even the AI-bros I've been speaking about condescendingly throughout this post. If anything, I'm actually most curious about and interested in the AI-bros responses, I'd love to hear yoyr thoughts.
Here are the AGI puzzles for anyone interested in trying them out: https://arcprize.org/arc-agi/3
The proliferation of models and harnesses times individual work styles, preferences and use cases creates an exponentially large space, making it a futile endeavor to diagnose the reason for your experience (Opus vs Sonnet, Claude Vs GPT, Claude Code vs Codex? Tool use configuration? Sun spots?) and give any advice. And besides, why engage in big-picture futuristic forecasting? Frontier labs should shill their product on their own dime, and their thesis will be proven right or wrong soon enough.
There's a clear object-level flaw in your writeup, however, and ironically it's the exact sort of confident slop we've come to associate with LLMs when they come short of the standard of human reasoning over novel context. This isn't to dunk: the standard, see, is very high, humans often need conscious effort to match it. That models can ever touch it is miraculous enough.
I mean this part:
You've played the games and you've thought of making it an argument, but you weren't curious enough to read on the actual scoring rule. It's contentious enough that Chollet has to make excuses on Hackernews.
To be clear:
So if the human baseline is 10 actions:
A score of 0.5% (0.005) does not mean that the best AI only solves 1/200th of the problems. Same score is achieved by it being 14.5 times less sample-efficient. But should we care? How much of an opportunity cost comes with wasted AI samples? A white collar professional in the US earns a Claude Max (x10) subscription in 1-2 hours; Claude Max will generate ≈2 OOMs more tokens in a month than said professional can; even if they're 1000 times less useful per token, that's a massive bargain. We already routinely afford AI that's this inefficient. It'll be more efficient soon, though.
Which is not to say it'll be cheaper. Consider that car rentals in American cities go for roughly one monthly Claude Max subscription a day. Sure, the US is a tough place and a car preserves you from being stabbed in the neck on public transport, but we can quantify the millimorts and assign a cost to them; after that, does a car for a day provide as much economic value as a fully exploited Claude Max for a month – tens of millions of Opus-grade output tokens? Seeing how fast OpenAI and especially Anthropic revenues have been growing, what do you think will be their asking price when all dumping from also-rans is rendered irrelevant?
Right now the cost of tokens is suppressed by the lingering user base acquisition phase, hardware gains, rapid competitive model churn and, more importantly, by the threat of cheap open weights models, mostly Chinese, increasingly Nvidia. Should those fall far enough behind, together with other minor competitors, and we enter the territory of a Frontier Cartel, $1000/month subscriptions as baseline expectation. (This, fyi, is implicitly Dario Amodei's theory of victory – see him invoking Cournot equilibrium on Dwarkesh's podcast.) I pray we don't. But people would pay for it. These systems aren't a joking matter, being shut out of them will be quite literally existentially threatening for many businesses soon enough.
The best model+harness scores 36%, by the way. But I'm more impressed by the 12.58% scored by a 4-layer CNN called StochasticGoose. Read this piece, it contains some pretty neat analysis.
Chollet's evals are neat too, but he's pushing a narrative against machine superintelligence from within Deep Learning paradigm, and he's getting embarrassingly biased, with more and more abstract justifications for denying what looks like inevitability.
More options
Context Copy link
Out of curiosity, what's the denominator? I'd be a lot more impressed if it missed 4/100 duplications than if it missed 4/5 of them.
About 20 duplications
Did you run it again on the newer, better code base, to see if it could catch another 75% of the duplications? And then again for a 75% chance to catch the last one?
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
...
Come on dude, you can't be serious.
I honestly have tired of retreading this conversation, the skeptics are immune to evidence. Even the people on my team who are enthusiastic about AI are configuring and using it poorly. The difference between prepping the environments, having agents go through and document the code base before running an orchestrated agent with dedicated subagents and just yolo throwing an agent at it - and I half expect you didn't even run the damn thing in agent mode - with a half cocked vague prompt practically hoping it fails so that you can own the ai bros is massive. You don't want it to work so it won't. I know this isn't convincing to you, I've tried being convincing to you guys, even after you're out of a job you won't be convinced. It's pointless.
edit: slightly less run on sentence(slightly)
While I do align with you in that I consider the current models very powerful and use them plenty myself, and that using some Sonnet + Cline workflow while claiming that AI is incapable is misleading, I do find this sort of crypto-style FOMO inducing rhetoric counter-productive and annoying.
If you believe that the models will usher in the end of history, that they really do end up as AGI, ASI, ushering in the singularity then no amount of using 2026 agents at work will do anything to save you or change the outcome.
On the other hand, in worlds where the models do plateau at some point and end up being commoditized enterprise tooling, nobody is doomed because they didn't use agents correctly in 2026; even boosters have very little consensus on what actually works right now. There will be time to adopt the tooling as capabilities are better understood, the UX will get better, and people will develop best practices and discard what doesn't work or what is no longer necessary; who's still using LangChain or fine-tuning LORA's on hands in 2026?
More options
Context Copy link
Today was our biweekly demo day. One of our engineers showed their work in putting together a harness for aiding in translating a winform app to a rest site + web front end. They built an orchestrator, an array of custom agents designed to handle the specifics of our environment, custom skills for understanding and interacting with our db ect ect, dozens of files. And as they walked through their prompting it became clear to me that they never actually invoke the orchestrator so they were in fact just using the vanilla agent.
It's obvious to me that we're like a couple more releases from all this work not being necessary, the future tools will simply as a matter of course customize themselves to the environments they're exposed to. But there is as of now an art form to getting really good results from current models. I'll say the most important concept is something like optimizing for "context density", you have ~1million tokens to work with but every marginal token long before that degrades performance while relevant context improves it. So you need to balance it out, using sub agents to offload discreet tasks and provide maximally dense reports. I have oracle agents that simply returns true or false, a single line, or a full report depending on what is asked of it. Of course this works even better if you can have the thing go through and summarize your code ahead of time with the intention of minimizing the guess and check nature of looking through your repo, every wrong search pollutes context. Or you can just poorly write out an ambiguous two lines, throw the vanilla nerfed agent in the deep end, have it go through 5 iterations of "compacting context"(i.e. throwing away important bits of information because you've hit the hard limit) and get back a sub par response then laugh about how dumb these hype monkeys are, maybe they're just so terrible at coding that this half assed demonstration is impressive to them?
Whatever dude, enjoy 6 more months of ignorance before it's impossible deny, you'll deny it anyways, not my problem, I tried to warn you.
What software/framework are you using for creating/orchestrating agent systems? I tried smolagents for a while but my results were clearly less good than just formatting my query nicely and giving it to a normal LLM.
Enough people I trust are getting good results with agents that I want to look into it more, but would very much appreciate any advice you can give.
At work we're stuck with VS code/github copilot which is not ideal but allows customizing/spawnimg subagents and tool calls which is the big requirement. Agents themselves aren't a big deal, they're essentially just custom prompts, they become important for being created as sub agents to isolate tasks to narrow contexts. There's some customization to your own env you should do but you can just start by asking Claude/gpt to spin up some basic ones that you can tweak over time. Basically any time I notice it get stuck or need to guess and check a bunch I have it makes some tweaks to the documentation. I'm cooking up a method to automate that process. Where I'd spend my time first is making some guide files for your code base. Putting together some easily greppable documents outlining our database schema greatly improved its ability to interact with it.
More options
Context Copy link
More options
Context Copy link
The real LLM psychosis is how insane interacting with genpop over AI is.
It's revolutionizing my ability to code, as in, I could not code (still can't) and yet I have a small and growing fleet of tiny software tools I always wished existed
It has 100% replaced google for me, and I haven't spent an evening trying to figure out what the best toaster/winter coat/part for my car is since late 2024, because I can now get an answer in 5-35 minutes. Maybe some of that is just cutting down analysis paralysis, but it's not like I've lowered my standards
It's materially improved my productivity at work (not a coder) and I know there are significant productivity gains on the table due to extremely slow corporate adoption (no agents allowed) and the fact I just haven't set aside a week to figure out jankier "agents at home" processes bc I'm hoping IT just comes to their senses.
Then you have people still in 2026 who genuinely think AI will "go away" or at least it's a passing fad like VR that some will enjoy and most will ignore.
Even at work, where the initial gains are obvious and easy to capture, I speak to managers who can barely get their teams to open ChatGPT/Copilot (hard to blame them for not wanting to use copilot) at all?!? I don't actually hate this, because all the people who don't adopt AI buy me ~6 extra months of employment before I'm replaced too, but it's genuinely making me feel insane.
I'm watching a technological revolution that's going to change every part of the adult life I thought I'd have , and so many people just.... don't see it? Insane
Note that both of those can be true at the same time. Total cumulative AI capex will probably cross $2 trillion this year, and cumulative opex is on the same order again. And that's just in the US.
If this "technological revolution" doesn't end up replacing a significant percentage of national labor costs, it actually might fade away - and the only thing remaining will be whatever open weight models can be run on cheap hardware at that point in time. And if the Chinese keep releasing last year's SOTA for free, none of the envisioned business models might hold water.
Either way, there's really only one way costs can go from here. If business AI doesn't go away, those of us still with jobs will get to work with agents costing our employers on the same order as the people they replaced.
Inference profit margins are pretty healthy. If all AI development stopped today, we'd end up with some very profitable AI providers once the dust settled
The AI Labs could all be happy fat tech companies if they just became inference providers.
It's the training, and the capex required to support ever larger training runs, that is the massive money sink.
This is 100% the future. I am very much on this path today. That being said, with token prices over time, the token spend might end up being a lot cheaper than the salary spend it replaced. Although if token prices fall I might just end up using 100x more lol
Yes, but we aren't quite there yet. Not even close, in my opinion, at least when we're talking about serious job displacement. Unless there's a phase change, were looking at years of more insane capex and opex.
And those trillions will need to be payed back with interest. We're not talking about a Netflix or Office 365 licence that every office drone just has. For millions of workers, access to those tools will rival transportation and housing in ongoing cost.
No problem, if your employer already has half the staff on SolidEdge/Ansys/ect licenses and generally does not care what toolboxes everybody gets. For the rest? Small business, low productivity labor, labour limited by hardware throughout (classic example: radiology)? They won't really contribute to paying off that debt, so they won't get a lot of tokens, and none from the good models.
We've gotten used to tech, especially software, being cheap. For the current economics to make any sense, this will come to a hard stop. On the cost side, AI is much more like an excavator than like a shovel, and it really needs to replace just as many workers to make sense building them.
And I can well imagine this never happening. Maybe they'll never get reliable enough for that much unsupervised work, especially work you can't write unit tests for.
I should have said "post bankruptcy, the AI Labs could be happy fat inference providers. I actually think the current SOTA models, if we worked on improving harnesses, etc a lot, could still be a huge change on their own. Obviously smarter model = better, but just what we have now with a ton of scaffolding can do a lot.
But yes, the amount of capex they have all spent means they now need something way better than "excel helper" to pay it back. But in a non-hyperscale world, LLMs as a normal technology could a profitable medium/large SAAS industry.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
Claude Sonnet 4.6 is the latest model though
It's not known for sure, the labs do not say, but it's very likely that sonnet 4.6 is a distillation of opus 4.6, essentially training a weaker model on the signal of the stronger model. What is definitely known is that sonnet 4.6 is considerably weaker than opus 4.6, given opus tokens cost 5x as much as sonnet tokens this should be clear. Sonnet is great for rote work, my admin agent that handles git and jira calls uses sonnet, but I'd never use it for core work.
More options
Context Copy link
This is a very uncharitable response, c’mon, you know that. Sonnet 4.6 is a cheaper, much smaller model that was released 12 days after Opus 4.6, of course it’s going to be worse.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
I'm not a programmer, but as far as creative writing goes, I have noticed that the Opus models are much stronger than the Sonnet line. I would not draw any conclusions about the SOTA from trying out a Sonnet model.
Cool game. I made it to level 5, but then the square stopped answering to my commands. Is this a glitch, or part of the game?
I'm guessing it's a glitch, I was able to complete all 7 levels
More options
Context Copy link
More options
Context Copy link
I can confirm this is true in non code scenarios too. I was not super diligent but I did some of my own testing last month and every AI liked the GPT-5.x answer almost every time
More options
Context Copy link
Did he evaluate any of them himself?
No.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
>I totally used the latest version and it still sucked!!
>Look inside
>Not the latest version
I'm not even gonna claim Opus would make a huge difference because the differences are quite small at this point, but fuck me you would think you might have a little humility when making such an emphatic claim, just to contradict yourself within a paragraph.
> uses a non thinking model
Is this bait?
Sonnet has thinking
Oh neat, I thought opus/sonnet was the same as GPTs thinking/instant
Opus is like GPT, Sonnet is like GPT-mini, Haiku is like GPT-nano (although this size is largely defunct).
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
Sonnet 4.6 is actually newer than the latest version of Opus, it came out a week or two later. So no, I didn't contradict myself. And Sonnet's training cutoff is about 6 months later than Opus's.
Newer =/= better it's literally a smaller and cheaper model than opus
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
Thanks for the link with the puzzles, I tried a few, and while I think I found the solution to two random ones I picked, I took some minutes for one of them, certainly among the three most brain-straining tasks I did today. (One of the others involved realizing that you can not mirror the pinout of a two-row 1.27mm connector by turning it 180 degrees, so take my assessment with sufficient sodium chloride.)
I mean, you could probably make an IQ 100 human solve them if you gave them an hour, threatened them with death and gave them enough ketamine to suppress the panic, but due to legal constraints you will get less mileage out of your employees on most workdays.
Basically, I concur that present models are not AGI, but I am much less certain that the median white color worker has much of a moat. If LLMs come for my job in two years, the fact that this proves that my job did not require general intelligence will be of little solace.
More options
Context Copy link
I’m no AI bro, but Opus 4.6 is genuinely really good and I’m concerned about my skills atrophying because I’m becoming highly reliant on it.
How did you use it? It makes a big difference if you use /init in Claude Code, followed by /plan where you describe what you want, and give it the ability to compile/run/test the code in a feedback loop.
I used the Cline plugin for JetBrains Rider, told it the path to the class, and asked it to extract out the duplicated logic throughout the class. So I could try again tomorrow with what you suggest (if that's even possible via the Cline plugin with Sonnet) and see if I get better results.
But my favorite part was when the AI assured me that its changes would so what I asked for without breaking existing fundtionality.
That said, the best and worst part if this is that it is pushing my incompetent Indian coworkers to use VS Code instead of VS because there is no Cline support in full fat VS and Cline is what the higher ups are mandating (I have a personal JetBrains subscription so that's what I'm using). .NET support in VS Code, especially for the many .NET 4.x projects we have kicking around (especially our WCF trash fires), is very lacking so I foresee my Indian coworkers' already pitiful productivity plummeting even further. I also forsee many requests for assistance coming to me and my Russian coworker (unsurprisingly he's the only other competent dev I work with regularly).
While I haven’t tried Cline, I know that Claude works best in Claude Code and Anthropic is trying its best (leaks aside) to have people locked into its ecosystem.
You won’t get a good picture of the capabilities of AI agents until you’ve tried the top models in a decent harness, unfortunately.
You're assuming he wants a good picture of the capabilities of AI agents. I get the strong impression from the sneering tone of the original post that he wanted to do just enough with a model that he could claim to have pwned the "AI-bros".
So I don't know if the OP was motivated by that, or if there's some other reason, but I've definitely noticed what seems like a big dichotomy in the way people approach modern generative AI tools. Which is that, some types of people see a tool with its limitations that make it fail in spectacular ways that seem silly or stupid, throw their hands up in the air and declare it as not sufficiently useful for their purposes. Other types of people see a tool with its limited abilities and figure out a way to exploit their abilities to accomplish things they couldn't without the tool, even if it means adjusting and inventing new workflows.
I first noticed this when I got heavily into Stable Diffusion in ye olden dayes of 2022. Of course, awful hands, foreground lines merging into background lines, inconsistent lighting, hallucinations, were all famous issues of image generation AI then. They're still issues now, but vastly reduced. Some people saw that and declared AI useless for their needs, since their hand drawing allows for the control they need that AI doesn't. Other people saw that generating messes with 7 fingers was like making one bad brush stroke on an empty canvas and giving up on the painting, and figured out that it's easy to iterate on subsections of the image multiple times, allowing someone to create illustrations that are far beyond their manual ability while still avoiding the common AI pitfalls.
I noticed it happening with LLMs shortly after, where some people zero in on stupid mistakes like that the hard R problem of strawberries and declare it too inconsistent or too stupid to be of much use. Other people zero in on the limited abilities and figure out how to build structures and scaffolding to allow the tool to exceed those natural limitations, enabling them to create code that they couldn't have before or that would have taken a lot more time before.
I don't think the former type of person is doing this in bad faith, or with a desire to sneer. I think there's probably just a spectrum in people's attitudes with something like this, and because AI is both ridiculously bad at some things and ridiculously good at others, this causes the spectrum to bifurcate.
I think there's a combination of:
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
Metaphysics aside, it has been blindingly obvious for a long time that LLMs do not have intelligence or reasoning ability. Look at tests like "how many R's are in strawberry", which could be passed by even the stupidest human as long as he had enough intelligence to have learned the alphabet. But LLMs fell flat on their face. And that's not the only instance; this stuff keeps happening. Whether or not LLMs are useful (I personally do not find them useful, as I've said in previous comments), they are most certainly not intelligent.
It seems to me that when people say things along the lines of "LLM's do not have intelligence" their definition of intelligence is something like "everything a human can do", and thus failing at something that can be done by a human proves a lack of intelligence, but in fact human intelligence is very jagged as well!
Should a chimp consider a human unintelligent because of our woefully inferior working memory?
Should a fly consider a human unintelligent because of our woefully inferior visual processing speed?
Should a squirrel consider a human unintelligent because of our woefully inferior spatial memory?
Sure, LLM's fail very basic things that can be done by humans, but humans also fail very basic things that can be done by LLM's; no human alive can write about the same breadth of abstract, novel topics in the same number of languages as even a very weak LLM, or write code as quickly as a LLM.
I fail to see how a LLM isn't intelligent in a way orthogonal to humans, in the same ways that animals are intelligent orthogonally to humans.
Yes, the bar here is "what humans can do". You're welcome to set the bar somewhere else if you like, but that is what I think is the appropriate bar to set. Humans are the apex of creatures in this world, and it just plain makes sense to me to compare our invention to us.
More options
Context Copy link
More options
Context Copy link
Mild agree. They aren't very creative. They definitely don't actually understand the world and are just """pretending""" to understand, which gets you most of the way there but isn't the same.
They clearly have this, what would you call what a "reasoning model" does?
Horrible selection given this is due to how tokens work. The carwash riddle that made the rounds recently is a way better example.
No, the car wash example is much worse. Rs in strawberry is something anyone who can spell can count out. The car wash riddle is something most intelligent people should catch if told to look out for it, but many of them still, let alone people of average or below average intelligence, will stumble over and fail because they skim the question or don’t actually catch the ‘trick’. It’s like a lot of classic riddles in that way.
Are you asserting that realizing that washing a car at the car wash requires a car is a trick? I asked half a dozen people that I know from a few different venues, from a bright high schooler, to a college professor, to a water heater installer, and they all looked at me like I was retarded for even asking the question.
I asked some people in my office and a couple got it wrong, I’ve met even smart people who miss basic gotcha riddles like this
Huh.
Maybe it's a rural/urban divide. How many of them own cars and drive regularly?
Probably none, but most people here have still driven at some point, and it’s not like they lack knowledge of what a car is.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
LLM’s tend to rationalize a suggested conclusion based on the context you provide it. I fully agree that they don’t independently reason their way to them. It’s like asking a high school dropout who smokes weed all day to scour the Internet and do research for you on a given topic. He’ll do that, and often do it poorly. At least that’s what the research seems to suggest. They often can’t distinguish between good sources and bad ones and they tend to omit a lot of information. But the hallucination problem makes them effectively worthless for anything that should be deemed critical or important IMO.
More options
Context Copy link
This is a bad example. The issue there was with the tokenizer, not the logical process - it's an equivalent failure of cognition to "Paris in the the spring". I'm not refuting your argument per se, just that one point.
Yes. I imagine that 'write python code to count the number of times the letter "r" appears in the word "strawberry"' is easily within the reach of current LLMs.
A better example example would be "Is the pool of the Titanic full or empty?", which is easily answerable by any five-year old who has ever played with a plastic ship in a bathtub, but which LLMs did badly on because they did not have the visual intuition of a sunken ship.
Gemini 3.1 Pro answered this question without issue, along with any variation (sink, bathtub, substituting the Titanic for an arbitrary cruise ship).
My question was not original, in fact, I might have seen it months or even a year ago, when it went viral because current models failed to answer it correctly.
So there might be three possible explanations:
(1) Models just got better and can solve this now.
(2) It appeared widely in the training data so models know how to answer it.
(3) AI companies explicitly patched their models to correctly answer that question (just like they might fix jailbreaks or outrage bait).
For all I know it could definitely be (1).
More options
Context Copy link
Amusingly, Gemini 3 fast writes and executed a python script to be sure.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
One thing I've found them remarkably useful for is "rubber ducking". It's nice to put my thoughts down in a single place and get quasi-related responses that might help me think about things.
I'd really love it if Claude code or Codex had a "rubber duck" mode that reasoned about the code and my own thoughts together, without an implicit expectation that it would be modifying the code. I'm surprised none of the harnesses have something like that yet.
Claude has plan mode and codex has a worse version of it.
Codex also has docs about setting up execplans which are hella useful
That's not quite what I'm looking for. Ideally, I'd like something that watches my typing in both my IDE and a chat window and provides feedback and commentary as I work.
I would also like this
SOTA tokens gotta get WAY cheaper before this works though
Pictures (screenshots) are so expensive
There's no need for visual screenshots. Claude Code (or Codex, or whatever) could feed the model keystrokes, or diffs, or some other text-like data. It wouldn't be general-purpose, but it could cover both the IDE and the chat window.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
I second the use of them for rubber ducking. I extensively use the ChatGPT 5.4 Thinking model for putting my thoughts down and then having it mostly organize them into something more coherent, relating them to other things, or using them to find literature on arxiv to peruse along my current thoughts.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
Personally, I like the epithet "glorified Markov chain" (1 2).
I prefer "jumped up matrix multipliers" myself.
More options
Context Copy link
It's okay, you're allowed to call them "clankers" here, yes, even with a hard R.
As long as you don't ask how many hard R's there are in "clanker".
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
Well then I'm officially dumber than the AI because I hadn't a clue what to do. I've never played any puzzles like these and the retro arcade game style completely confused me because, again, never played games like that. I'm assuming I'm supposed to move the square to the exit, but who the hell knows?
A lot of these puzzles are indeed culture dependent, the culture being "yeah I grew up playing video games, console games, etc."
Figuring out the goal and the controls is part of the test.
More options
Context Copy link
There are puzzles which are non-arcade style, e.g. this. Basically, you have an input grid of colored squares which gets transformed to an output grid. You get a few examples and then have to deduce the rule and solve one.
I believe your link goes to a problem from the arc-agi-1 dataset, not the arc-agi-3. The former is basically "solved" at this point.
More options
Context Copy link
More options
Context Copy link
For the first puzzle, you need to change the symbol to match the one at the "exit" of the level, then move to the exit. The first level only has a way to change the rotation, later levels let you change the color and the symbol itself.
More options
Context Copy link
More options
Context Copy link
My company also uses Claude Code, in my case including Opus. It is quite useful: if I give it a feature, then, step by step, have it create the tests to verify the feature is complete and write code based off of the tests then its pretty helpful. At the same time, I need to review all the code and say "ok, you need to simplify here. You made an assumption there that is not quite accurate, and we need an adjustment there." It has gotten more useful, and my company has a lot of solid documentation that helps Claude, but there is a fundamental dynamic of me doing the actual engineering, Claude implementing the architecture, me reviewing and suggesting changes, and just iterating through that loop. It's quite useful, and Claude has improved over the past few years, but I don't think the fundamental relationship between software engineer and LLM has changed in the past few years even with its improvements. For additional context, I work in FinTech so error tolerance is very low.
While Claude can IME make software engineers more efficient, and thus either decrease employment and/or cause Jevon's Paradox, I think actually replacing software engineering as a whole is probably AGI-Complete. I imagine with medical doctors it's a similar situation where it can let them help more patients, but the buck ultimately has to stop with the Doctor until AGI.
The core question, to me, on if/when we get AGI seems to be this: can we reach AGI through iteratively improving LLMs and adding in supporting models to fill in gaps, or do we need an entirely new type of model to reach there. If the first, AGI may be coming pretty soon. If the second, the timeline gets a lot harder to predict. I'm not confident which of the above two paths, or a different path, is what will happen.
More options
Context Copy link
I don't know if I would call myself an AI sceptic, but I haven't seen a huge win from agentic coding in my professional life.
I work in Java - LLMs seem to do better with python or typescript.
I work on a legacy codebase - LLMs tend to do better with greenfield.
I work on a large codebase that is architected as a monolith - it's been my experience that the odds of an LLM shitting the bed begin to rise after about 15,000 lines and approaches 100% after about a million lines.
I work on a codebase that has a surprising amount of non-CRUD code. LLMs get confused by that - especially when it's similar to stuff on GitHub, but not identical.
Quite a few of our customers operate in a regulated industry, and LLMs absolutely make shit up about regulatory compliance right now.
Overall, I don't think that my job is at risk, but I do have some concerns that somebody might vibe up a competitor that can eat enough of our customer base to knock us out of profitability. After the Delve fiasco, they might just straight up lie about compliance and temporarily capture some of the regulated customers as well.
Moving on from my personal experience, I have two acquaintances who are deep in agentic mania right now.
The first is not a professional programmer, but he has always wanted to use programming to achieve specific goals in his personal hobbies. Claude code has been an absolute god send for him. He's writing things that don't need to scale, don't really need to perform, and have no real consequences for incorrectness. From his perspective, the God-Machine is here electro-immanentize the cyber-eschaton and the techno-rapture is nigh. The number of "ha ha you're gonna be out of a job and die in a gutter ha ha"-coded jibes I've gotten from him has been starting to wear on me. It's a perfect example of "the agent is only bad at things where I have personal expertise" playing out right in front of my eyes.
The second is a professional programmer, and his employer is going all in on agentic coding. They're actively tracking how many tokens each person is burning and actually using AI detectors in reverse to make sure that PRs are sufficiently crammed full of AI code. They're having huge problems because the agents are getting stuck in endless loops because they can't figure out how to write code that passes their pre-existing automated test suites. In the end, they're actually considerably less productive, but line go up. He's stuck with it, so he's desperately trying to make things not suck. He's convinced that there must be some secret sauce that makes the agents write quality code and not descend into iterative schizophrenia every time it encounters a ticket that's more complex than "change the color of this CSS class". He's spent dozens of hours of his own time trying to figure it out, and he had, until recently, been absolutely convinced he could make it work. Just one more bit of prompting - just a few more custom skills and it would do what all the boosters promised. He finally broke down recently and had a full blown crisis because he looked at Steve Yegge's gas town build pipeline. The damned thing basically never passes. Steve Yegge, the guy who is both highly technical and absolutely sold on the future of agentic coding, can't consistently get this shit to work. At this point my acquaintance called it all nonsense and gave up. He's doing the absolute bare minimum at work that he needs to do in order not get fired and he's waiting for the tool chain to stabilize.
I'm not really sure where I'm going with it, but the three different experiences are interesting.
Holy hell, hoping my employer doesn't try this. They're already tracking our AI usage pretty thoroughly.
Their star performer hasn't actually generated a deliverable yet. He spent over $7,000 in API usage in one month doing something something training agents for HR something something. He doesn't even have a document outlining the plan.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
It's worth pointing out that within a day, the AIs had gotten to 36%: https://www.symbolica.ai/blog/arc-agi-3.
In general, though, I agree. My take is that AIs are good at solving leetcode style problems but nothing bigger. The way to be productive with them is to know how to divide up your tasks into leetcode tasks for the AI and the non-leetcode tasks for people.
Is that because they actually improved at problem solving, or because the companies running the models gave them answers? I seem to recall that they have been caught doing the latter before, which is why people strive to test these things with new questions as much as possible.
It's not an unreasonable criticism in the abstract, but a few minutes of reading shows that it just doesn't apply in this case.
The higher score was published by Symbolica AI using Opus 4.6. So it couldn't be that Opus was retrained with the answers.
"This uses the same harness we previously published" so it couldn't be that they simply prompt the model with the answers.
The harness is published so you can see for yourself.
Benchmarks do not publish their entire problem set, so in general it's impossible for labs to simply "give the models the answers" to the problems that aren't published.
To be clear, I wasn't making the claim, it was a genuine question. I do recall reading stories in the past where the model creators basically cheated on benchmarks, but did not know if that was the case here or if it was a genuine improvement.
There's a fair bit of fudging, to be sure. Things like running your own model with different prompts or parameters, or just training to the test. But there's a limit. Since the actual ARC-AGI-3 test is not published, the only way companies could really "cheat" would be to sniff the data that's being fed into the models by the testers. While technically possible, that's pretty much Theranos-level fraud; I don't really suspect any AI company of doing this.
EDIT: Oops, I should have clicked the link. The 36% result was on the publicly available data, so it's not really an "official" result. For the reasons @sarker said, I still think it's fine, but it's not quite as bulletproof as I thought.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
It's because they wrote a "good prompt" to get the models "thinking" in the "right way". No data leakage here.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
The scoring doesn't tell you how many levels the models completed, but how efficiently they completed them compared to humans. It uses squared efficiency, meaning if a human took 10 steps to solve it and the model 100 steps then the model gets a score of 1% ((10/100)^2).
Edit: Also, "humans" means at least 6 out of 10 volunteers must have been able to solve some puzzle for it to be included, not 10 out of 10. Thus the test, even ignoring the previous caveat, is comparing the AI not against humans but against 40th percentile of humans. bottom of page 15
Indeed, there's almost nothing scientific about the scoring system of ARC-AGI-3; the test itself is kinda neat, and still highlights something that smart humans do (somewhat) better than the best LLMs, but it's dropped any pretense at being an actual measure of "general intelligence", and frankly they deserve to be ridiculed for the sensationalist scores.
Why is completion speed the main factor? Why is the difference squared? Speedruns are not how we define intelligence. If the squirrel in your backyard can solve sudokus, but a top-10th-percentile-of-self-selected-sudoku-solvers human can do it faster, you don't laugh and say "ha ha, this squirrel is so dumb". Also note that the test cuts the model off if it takes 5x longer than the smart human, and later questions build on earlier ones, so if a model goes slowly once it's handicapped for the rest of the test. (Again, this is probably completely intentional, to help deflate scores further.) They used a majority-of-self-selected-humans-can-solve-this metric for puzzle inclusion but not for the scoring. Why? Pure showmanship.
I suspect that average humans who take the test would probably also get a very low score! The old tests and metrics (including ARC-AGI-2) were useful because they showed something that humans genuinely find easy, but LLMs fail at. Those metrics have almost reached saturation, so I guess now we're switching to puzzles that some humans can solve but LLMs ... uh ... solve a bit slower. Ok?
But hey, the "0.5%" number does help low-information AI skeptics like OP point and laugh, so it's another "win" for AI journalism.
More options
Context Copy link
Note the "human baseline" isn't based on the human average, it's based on the "second-best first-run human playthrough" among the 10 people tested for each individual puzzle.
More options
Context Copy link
Many of the puzzles have a limited amount of steps in which they can be completed. The puzzle it loads by default, for example, has an energy bar that is depleted with each move you make. You have to be able to actually reason about the rules of the game and the objects within the levels to be able to complete them at all, you only have maybe a 10% buffer of energy for mistakes and/or choosing a less efficient route/method to solve them.
Correct, that's why I suggested people in the 100-110 iq range (roughly) or higher would likely be able to solve them.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link