site banner

Culture War Roundup for the week of June 9, 2025

This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.

Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

  • Shaming.

  • Attempting to 'build consensus' or enforce ideological conformity.

  • Making sweeping generalizations to vilify a group you dislike.

  • Recruiting for a cause.

  • Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:

  • Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.

  • Be as precise and charitable as you can. Don't paraphrase unflatteringly.

  • Don't imply that someone said something they did not say, even if you think it follows from what they said.

  • Write like everyone is reading and you want them to be included in the discussion.

On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

5
Jump in the discussion.

No email address required.

I have seen the AGI, and it is Gemini 2.5 Pro

If you go and look at Metaculus you’ll see that regardless of the recent breakthroughs like VEO3 and OpenAI’s “Ghiblification” (probably the first AI image system where accusing the outputs of being “slop” makes the accuser look unreasonable rather than denigrates the picture itself) all the “when AGI?” benchmarks have been uncharacteristically stubborn. The question asking about “weak” AGI has gone nowhere for two weeks months while the median prediction on the question about full AGI has receded three years from 2031 to 2034.

It looks like Scott’s AGI 2027 push has failed to convince the markets. For the informed person, AGI is coming “soon” but isn’t imminent. However I think that actually AGI is already here, is freely available to anyone with an internet connection and is called Gemini 2.5 Pro.

For those of us not in the know, at the moment you can access Gemini 2.5 Pro for free with no limits on Google’s AI studio right here: https://aistudio.google.com/prompts/new_chat ; yep, you heard that right, the literal best text model in the world according to the lmarena.ai leaderboard is available for free with no limits and plenty of customisation options too. They’re planning on connecting AI studio access to an API key soon so go and try it out for free right now while you can. No need to overpay for ChatGPT pro when you can use AI studio, and it’s a lot lot better than the Gemini you get via the dedicated app/webpage.

Our story begins a few days ago when I was expecting delivery of a bunch of antique chinese hand scroll paintings I had purchased. Following standard Chinese tradition where collectors would add their own personal seal in red ink to the work and seeing as these scrolls already had a bunch of other seal glyphs I wanted to add my own mark too. The only issue was that I didn’t have one.

This led to a rabbit hole where I spent a good portion of my Saturday learning about the different types of Chinese writing all the way from oracle bone script to modern simplified script and the different types of stones from which seal were made. Eventually after hours of research I decided I wanted a seal made from Shoushan stone written in Zhuànshū script. That was the easy part.

The real difficulty came in translating my name into Chinese. I, with a distinctly non Chinese name, don’t have an easy way to translate the sounds of my name into Chinese characters, which is made all the harder by the fact that pretty much all Chinese syllables end in a vowel (learning this involved even more background reading) even though my name has non-vowel ending syllables. Furthermore, as a mere mortal and not a Son of Heaven with a grand imperial seal, decorum dictated that my personal mark be only 4 characters and around 2cm*2cm, enough to be present but not prominent on the scroll.

All this led to further constraints on the characters to be put on my seal, they couldn’t be so complex that carving them on a small seal would be impossible, and yet I needed to get my name and surname as accurately onto it as possible. Naturally this involved a lot of trial and error and I think I tried over 100 different combinations before coming up with something that sort of (but not completely) worked.

There was one syllable for which I could not find any good Chinese match and after trying and rejecting about a dozen different choices I threw my hands up and decided to consult Gemini. It thought for about 15 seconds and immediately gave me an answer that was superior to literally everything I had tried before phonetically, however unfortunately was too complex for a small seal (it wouldn’t render on the website I was buying the seal from).

I told Gemini about my problem and hey ho, 15 seconds later another character, this time graphically much simpler but sounding (to my non-Chinese ears) exactly the same was present and this actually rendered properly. The trial and error system I was using didn’t even have this particular character as an option so no wonder I hadn’t found it. It also of its own volition asked me whether I wanted to give it my full name so it could give me characters for that. I obliged and, yes, its output mostly matched what I had but was even better for one of the other syllables.

I was honestly very impressed. This was no mean feat because it wasn’t just translating my name into Chinese characters but rather translating it into precisely 4 characters that are typographically simple enough to carve onto a small seal, and with just a few seconds of thought it had managed to do something that had taken me many hours of research with external aids and its answer was better than what I had come up with myself.

All this had involved quite a bit of back and forth with the model so out of curiosity at seeing how good it was at complex multi step tasks given in a single instruction I opened up a fresh chat and gave it 2-3 lines explaining my situation (need seal for marking artworks in my collection). Now I’m an AI believer so I thought it would be good enough to solve the problem, which it absolutely did (as well as giving me lots of good unprompted advice on the type of script and stone to use, which matched my earlier research) but it also pointed out that by tradition only the artist themselves mark the work with their full name, while collectors usually include the letter 藏 meaning “collection”.

It told me that it would be a Faux Pas to mark the artworks with just my name as that might imply I was the creator. Instead it gave me a 4 letter seal ending in 藏 where the first three letters sounded like my name. This was something that I hadn’t clocked at all in my hours of background reading and the absolute last thing I would ever want is to look like an uncultured swine poseur when showing the scrolls to someone who could actually read Chinese.

In the end the simple high level instruction to the AI gave me better final results than either me on my own or even me trying to guide the AI… It also prevented a potential big faux pas that I could have gone my whole life without realizing.

It reminded me of the old maxim that when you’re stuck on a task and contacting a SysAdmin you should tell them what your overall goal is rather than asking for a solution to the exact thing you’re stuck on because often there’s a better way to solve your big problem you’ve overlooked. In much the same way, the AI of 2025 has become good enough that you should just tell it your problem rather than ask for help when you get stuck.

Now yes, impressive performance on a single task doesn’t make AGI, that requires a bit more. However its excellent performance on the multilingual constrained translation task and general versatility across the tasks I’ve been using it for for the last few weeks (It’s now my AI of choice) means I see it as a full peer to the computer in Star Trek etc. It’s also completely multimodal these days, meaning I can (and have) just input random PDFs etc. or give it links to Youtube videos and it’ll process them no different to how a human would (but much faster). Funny how of all the futuristic tech in the Star Trek world, this is what humanity actually develops first…

Just last week I’d been talking to a guy who was preparing to sit the Oxford All Souls fellowship exam. These are a highly gruelling set of exams that All Souls College Oxford uses to elect two fellows each year out of a field of around 150. The candidates are normally humanities students who are nearing the end of their PhD/recently graduated. You can see examples of the questions e.g. the History students get asked here.

However the most unique and storied part of the fellowship exam (now sadly gone) was the single word essay. For this, candidates were given a card with a single word on it and then they had three hours to write “not more than six sides of paper” in response to that prompt. What better way to try out Gemini than give it a single word and see how well it is able to respond to it? Besides, back in 2023 Nathan Robinson (or Current Affairs fame) tried doing something very similar with ChatGPT on the questions from the general paper and it gave basically the worst answers in the world so we have something to compare with and marvel at how much tech has advanced in two short years.

In a reply to this post I’m pasting the exact prompt I used and the exact, unedited answer Gemini gave. Other than cranking up the temperature to 2 no other changes from the default settings were made. This is a one-shot answer so it’s not like I’m getting it to write multiple answers and selecting the best one, it’s literally the first output. I don’t know whether the answer is good enough to get Gemini 2.5 Pro elected All Souls Fellow, but it most certainly is a damn sight better than the essay I would have written, which is not something that could be said about the 2023 ChatGPT answers in the link above. It also passes for human written across all the major “AI detectors”. You should see the words and judge for yourself. Perhaps even compare this post, written by me, with the output of the AI and honestly ask yourself which you prefer?

Overall Gemini 2.5 Pro is an amazing writer and able to handle input and output no different to how a human would. The only thing missing is a corporeal presence but other than that if you showed what we have out there today to someone in the year 2005 they would absolutely agree that it is an Artificial General Intelligence under any reasonable definition of AGI. It’s only because of all the goalpost moving over the last few years that people have slowly become desensitized to chatbots that pass the Turing test.

So what can’t these systems do today? Well, for one they can’t faithfully imitate the BurdensomeCount™ style. I fed Gemini 2.5 Pro a copy of every single comment I’ve ever made here and gave it the title of this post, then asked it to generate the rest of the text. I think I did this over 10 times and not a single one of those times did the result pass the rigorous QC process I apply to all writing published under the BurdensomeCount™ name (the highest standards are maintained and only the best output is deemed worthy for your eyes, dear reader). Once or twice there were some interesting rhetorical flourishes I might integrate into future posts but no paragraph (or larger) sized structures fit to print as is. I guess I am safe from the AI yet.

In a way all this reminds me of the difference between competition coding and real life coding. At the moment the top systems are all able to hit benchmarks like “30th best coder in the world” etc. without too much difficulty but they are still nowhere near autonomous for the sorts of tasks a typical programmer works with on a daily basis managing large codebases etc.. Sure, when it comes to bite sized chunks of writing the AI is hard to beat, but when you start talking about a voice and a style built up over years of experience and refinement, well, that is lacking…

In the end, this last limitation might be the most humanizing thing about it. While Gemini 2.5 Pro can operate as an expert Sinologist, a cultural advisor, and a budding humanities scholar, it cannot yet capture a soul. It can generate text, but not a persona forged from a lifetime of experience. But to hold this against its claim to AGI is to miss the forest for one unique tree. Its failure to be me does not detract from its staggering ability to be almost everything else I need it to be. The 'general' in AGI was never about encompassing every niche human talent, but about a broad, powerful capability to reason, learn, and solve novel problems across domains—a test it passed when it saved me from a cultural faux pas I didn't even know I was about to make. My style, for now, remains my own, but this feels less like a bastion of human exceptionalism and more like a quaint footnote in the story of the powerful, alien mind that is already here, waiting for the rest of the world to catch up.

Funny how of all the futuristic tech in the Star Trek world, this is what humanity actually develops first…

No that was the comm-link.

I want to believe, but I asked Gemini 2.5 Pro to spec a computer for me, and it starts hallucinating motherboards that don't exist, insisting on using them even after being told they don't exist. Maybe it's OK for brainstorming, but everything it says needs to be double-checked. We ain't there yet.

So what can’t these systems do today?

They can't solve any problems that anyone asked to solve. It seems like every piece of software is touting new AI-integrated features, except they can't do anything other than generate text I didn't ask for. I more or less write for a living, but I don't need AI to write for me, and it isn't even capable of doing the kind of writing I need it to do. The best case scenario is that it can generate pro-forma motions and the like, but it would take me just as long to input the information into AI as it would to just type it into the document myself.

On the other hand, there's a lot of stuff relating to the software itself that so-called "artificial intelligence" should be able to make easier, but simply can't. For instance, last week I was drafting a document using another document as a template. It had a numbered list. I deleted some of the irrelevant items and replaced them with other items I had pasted from a different document. Except this completely fucked up the numbering system so that it ended at 6 and then started over again at 1. It also caused this weird indentation mismatch. I just wanted to get everything uniform from the top down, and I didn't really care what the formatting looked like, so long as it was consistent. I tried to fix it on my own but couldn't figure out what was wrong, even after searching the internet. Then it dawned on me that this would be the perfect problem for the AI. I very specifically described the problem to the AI assistant and described how I wanted it to look. I was informed that it could not fix the problem. I was informed that it couldn't even tell me how to fix the problem. Same with the AI assistant in Adobe Acrobat.

This seems like it would be the biggest gain for AI technology, especially for complicated pieces of software that frustrate users to no end. If I could just tell the computer what to do in plain English instead of needing specialized knowledge, it would solve a lot of problems. But apparently this isn't possible. They're more interested in slapping a chatbot onto it and claiming it's now intelligent. Bullshit. UI issues are some of the biggest complaints users have, and coming up with an interface so straightforward would give any company a competitive advantage. Remember how clunky Word was before the ribbon? But instead we're supposed to think that because it can generate sloppy text it's somehow going to put us out of work. The truth is, it can't even format it correctly.

So what can’t these systems do today?

Far as I know they can't renew a prescription for you, which has been my personal benchmark for 'agentic' AI for a year or so.

Or maybe its not that they can't but they aren't permitted to for liability or similar reasons.

I just want to be able to ask the thing "I'm running low on [pharmaceutical product], please order up a refill. And sometimes that process requires navigating multiple phone trees for both the pharmacy provider and the party doing the prescribing, to provide various sorts of documentation, sometimes via fax(!) and to make a payment and arrange for pickup or delivery at a convenient time.

All stuff I find very boring and tedious, so if I could offload it to an AI I would do so in a heartbeat.

Not to pick on you since this seems like a common category of problem... but the task is entirely artificial. There's no technical reason renewing a prescription requires you to do anything more than log into your pharmacy somehow and click a "renew" button. Any further complexity is because the pharmacy decided to waste your time.

I feel like I often hear people suggest using AI to navigate some unnecessary complexity like that, when what you actually need is systems that don't suck. Or at least being allowed to have third-party systems exist that work around them sucking. AI doesn't really have anything to do with it. If someone comes up with an AI bot that works around the poor design, people will come up with even worse designs to counter that.

There's no technical reason renewing a prescription requires you to do anything more than log into your pharmacy somehow and click a "renew" button. Any further complexity is because the pharmacy decided to waste your time.

Yes.

YES.

YOU'D THINK THAT.

But you click the 'renew' button and the Pharmacy reports that you have to get a new scrip from your physician. Well okay. You call the physicians office and they say you need to submit proof of your identity sufficient to make sure they're writing it for the right person. E-mail won't do, they need it faxed or you can stop by in person. Then once that's done, they will forward the scrip to the pharmacy. But it turns out the only way to check if the pharmacy got the scrip is to actually call, which means waiting on hold, and once you've done all the intermediate steps, THEN the 'renew' button works. And then add in a layer of fun if you want to get insurance involved.

Maybe other pharmacies do it differently, but I assume a nontrivial part of the process is regulatory compliance and antifraud measures.

Its one of those tasks where it could be a 2-5 minute diversion, or 90 minutes of running around, navigating phone trees and getting various ducks in a row to get the particular outcome you want/need, b/c the parties involved are not motivated to help much, are concerned about fraud/deception, and are not in good communication with each other.

So as the one person properly motivated to complete the task, who isn't worried about fraud, and can act as the intermediary between the parties, I'm now shouldering the organization burden. It is what it is, but I'd sure love to throw AI at the task.

I'm one of those "won't go to the doctor unless a limb falls off" guys, so I was in my 30s before I realized that doctor prescriptions are sent to a specific pharmacy, and you cannot buy the prescribed medicine from any other pharmacy. If you want to buy your prescribed medication from a different pharmacy, you have to talk to the doctor (or, more likely, the nurse that is being supervised by a doctor) and ask them to send the prescription to a new pharmacy. What the actual fuck?!

It's shit like this that convinces me to stick with OTC pills until the day I die.

In Mexico I filled prescriptions by taking the slip of paper to a farmacia of my choosing. Just walk in and get it. Such a better system.

In ye old days we gave you a physical prescription that you could take with you, show up the pharmacy and shout "gib dis" and if they said "no have" you could take the same piece of paper to another place.

Now we mostly use electronic medical records and we ask you what your pharmacy is and send the information directly to that pharmacy.

Why do we do it that way? Likely things like "regulatory burden" and "let's not accidentally D-DOS the pharmacies with all of these requests."

Now I personally prefer paper script pads for some types of things and ask for them myself, but if your doctor does not allow that it likely it is because whoever owns them (large hospital system or PE firm) does not permit them. We don't complain too much because handwriting a prescription is a pain the ass and our handwriting is more ass.

I don't care if the prescription is printed rather than handwritten. Or if it's in a national database instead of being a physical document. I just don't want it to be sent to a single pharmacy; that's fucking ridiculous.

Even better, no medications should require a prescription. Let it all be OTC. Then the prescription can simply be information about what your doctor recommends.

Again, the ability to walk around with a general prescription that can be used at any pharmacy is the default state - in essence it has been removed by regulatory burden and corporate oversight.

No reason it can't come back other than those things (and plenty of doctors are still able to prescribe via paper).

Take it up with the government.

Expanded OTC formularies are something that can be done in different cultural milieus but is simply incompatible with America. Too many people would kill or harm themselves or others. The costs and externalities are too high.

but the task is entirely artificial.

It is trained on the corpus of human text, most of which pertains to artificial problems rather than real problems. So AI should be better at the administrative-state stuff than the real stuff.

That isn't quite what I meant. Sure I believe an LLM-based agent may be able to accomplish that task. But if the intention were to make the task automatable, then you wouldn't need one. Since the point is to make the task not automatable, this is just a step in an arms race of making the task more frustrating.

As someone who is not nearly as impressed with AI as you, thank you for the Turing test link. I've personally been convinced that LLMs were very far away from passing it, but I realize I misunderstood the nature of the test. It depends way too heavily on the motivation level of the participants. That level of "undergrad small-talk chat" requires only slightly more than Markov-chain level aptitude. In terms of being a satisfying final showdown of human vs AI intelligence, DeepBlue or AlphaGo that was not.

I still hold that we're very far away from AI being able to pass a motivated Turing test. For example, if you offered me and another participant a million dollars to win one, I'm confident the AI would lose every time. But then, I would not be pulling any punches in terms of trying to hit guardrails, adversarial inputs, long-context weaknesses etc. I'm not sure how much that matters, since I'm not sure whether Turing originally wanted the test to be that hard. I can easily imagine a future where AI has Culture-level intelligence yet could still not pass that test, simply because it's too smart to fully pass for a human.

As for the rest of your post, I'm still not convinced. The problem is that the model is "demonstrating intelligence" in areas where you're not qualified to evaluate it, and thus very subject to bullshitting, which models are very competent at. I suspect the Turing test wins might even slowly reverse over time as people become more exposed to LLMs. In the same way that 90s CGI now sticks out like a sore thumb, I'll bet that current day LLM output is going to be glaring in the future. Which makes it quite risky to publish LLM text as your own now, even if you think it totally passes to your eyes. I personally make sure to avoid it, even when I use LLMs privately.

Well remember even passing the basic casual Turing test used to be extremely difficult. It took at least 65 years between the creation of the test and systems beginning to pass it consistently. And I still remember science articles and science fiction stories from the 90s and 2000s talking about it like it was the holy grail. It’s only in the past few years that it’s started to seem like an inadequate measurement of an AI’s capabilities.

Interestingly your motivated Turing test starts to sound a lot like the Voight-Kampff test from Bladerunner.

Is there any reason the test was treated as a holy grail other than the "Turing" name brand? I can't see any theoretical justification for it.

The theoretical justification for it is something analogous to the idea of a Universal Turing Machine, though obviously not rigorous.

If we come up with any other test to determine "human-level intelligence", a test that can't be beaten by a "spiky" non-general intelligence that outperforms in unexpected areas (I'm old enough to remember when chess performance was a generally-accepted sign of intelligence!), then someone judging a Turing test can just use that other test. If it turns out that for some reason an AI really can't understand how to respond to a weird hypothetical about upside-down tortoises, then the judge can ask them about upside-down tortoises. If computers had sucked at chess, a judge could have asked the AI to play chess. Computers only start to beat a Turing test reliably when there's nothing a judge can come up with that they can't beat.

I can't see any theoretical justification for it.

This is the way I always understood it. Lacking the ability to detect any internal experience other than our own, the way we distinguish between 2 different things is by applying input to them and seeing if there's differences in output, e.g. we shine light on it and detect what qualia the light that reflects off of it and into our eyeballs generate in our minds. Detecting intelligence isn't as simple as detecting the color or shape of something and wouldn't involve inputting light rays but rather words to see what words get returned in response. If there's no way to distinguish between 2 different entities in this way, then it makes no sense to say that 1 has human-level intelligence while the other lacks it. For that to be the case, there must be some way to induce different outputs from those 2 things with the same input. In something relating to intelligence, anyway; input-output of words probably don't cover the entirety of all possible detection mechanisms, but they do seem to me to cover a lot.

Because it was an impossibly high bar. Nothing was able to do that, for years. The idea that you’d be able to talk to a computer program and not recognize it seemed like science fiction.

Very intuitive, sensible, and wasn’t surpassed for 80 years.

In the same way that 90s CGI now sticks out like a sore thumb, I'll bet that current day LLM output is going to be glaring in the future.

Interesting idea! Although there is definitely CG from the '90s that still looks downright good. Jurassic Park comes to mind as a masterpiece, which largely worked because the artists understood what worked well with the technology of the time: night shots (few light sources, little global illumination) of shiny-but-not-reflective surfaces (wet dinosaurs), used sparingly and mated with lots of practical effects.

CG only became a negative buzzword when it got over hyped and stretched to applications that it just wasn't very good for at the time. In some ways it's improved since (we can render photoreal humans!), but it still does get stretched in shots that are IMO just bad movie making ideas ("photorealistic, yet physics-defying").

I could see AI slop going the same way: certain "tasteful" uses still look good, but the current flood of AI art (somehow all the girls have the same face, and I've definitely spotted plenty of online ads that felt cheap from obvious AI use) will be "tacky" and age poorly.

It also passes for human written across all the major “AI detectors”.

The stench of AI is great with the essay you posted in the comment below. Just looking at it sets off many alarm bells.

Also I literally pasted the first paragraph into gptzero and it returned a score of 100% ai

Perhaps even compare this post, written by me, with the output of the AI and honestly ask yourself which you prefer?

Your gemini essay, posted below, is not worth the pixels it's printed on and not worth reading past the point of smelling the ai stench. Clearly the human one is better.

I assure you the first paragraph was written by me. Do you really think the AI would automatically reference the "nowhere in two weeks" rdrama.net meme?

I'm referring to the gemini output that you posted in a comment below. The one that starts with "Of all the names that echo from the chambers of power ..." and which you falsely claimed passes most AI detectors.

I edited my above comment for clarity

Hm, the first paragraph of that is coming up 0% AI written for me in ZeroGPT.

/images/1749486945465418.webp

gpt zero (the naming is so annoying)

zerogpt sucks.

Interesting; yes GPTZero says the first paragraph is AI, however for the first half of the text (it won't let me upload more than 5000 characters at once) says it's a coinflip between being human or AI and there are paragraphs which it is highly sure are human written.

/images/1749494140017286.webp

I'll admit that some of the later paragraphs are less obviously AI generated. The first few paragraphs are extremely stinky then it just devolves into academic-sounding nonsense.

Anyways the point still stands that the answer to this prompt does not convincingly pass as human written.

  1. You say gemini is so good yet you don't compare its performance on this art task with other models like chadgpt and claude
  2. The slop certainly rolled persuasion on you, but as non chinese you have no ability to actually judge the result. What you chose may be incredibly cringe and you would have no clue.
  3. In my ime asking AI to make some chinese names resulted in some incredibly cringe results.
  4. You, in fact, are a poseur. Nobody will judge if you just get some scrolls because they look good. But if you start apeing traditions you have no understanding of it runs the risk.
  5. Gemini is good but in my ime underperforms its benchmarks. Nevertheless it's not "AGI"any more than competing models such as o3 and claude thinking.

Well, for one they can’t faithfully imitate the BurdensomeCount™ style

I think trouble with style transfer is very much a chatbot related issue. I think current ai can do it but that would require sacrificing performance and possibly alignment.

I actually meant to test trying to do style transfer on some base models but never gpt around to it.

I don't think your premises are true and meaningful. Some may be true. Some would be meaningful if they were true but aren't.

That's right, every single time any of us goes to a Western LLM provider's chatbot and says hi, they bleed money. If you pay them 20 dollars, they bleed even more money since you are a power user and get access to their shiny objects. The newest being deep research, which according to some estimates, costs a thousand USD per query. Yes, a thousand.

A thousand USD? Surely not. Deepseek R1 has a kind of deep research and it's very cheap. You say in comments you realise that was speculation but I think you just don't have any kind of understanding what a believable cost is for this kind of service. It just doesn't cost that much per call!

Also, OpenAI does have financials that tell a totally different story to what you're saying: https://sacra.com/c/openai/

OpenAI hit $10B in annualized revenue run rate as of May 2025, nearly doubling from $5.5B in December 2024.

OpenAI currently operates at ~40% gross margins

Inference is cheap and profitable.

Who cares if training costs go to 1 billion? Or even 10 billion? That's a tiny amount of money in the grand scheme of things. Facebook spent 20 billion on the metaverse, earned negligible returns and shrugged it off. The reason there's few profits on AI is because of massive investment and competition, everyone recognizes the enormous value and potential of this technology.

The newest being deep research, which according to some estimates, costs a thousand USD per query.

I would bet fairly good odds that this is not true.

This is talking about the cost to run on a test where they gave it a ludicrous token budget to perform sota evals, not the thing you run by default as a consumer.

You misunderstand significantly what they spent $1,000 on. It's per task, not per query. I remember the results this article is summing up. If you look at the originak source, you'd see that it's 1k dollars per task... using a super chain of thought reasoning workflow, spinning up a ton of separate agents, running and restarting up to budget, and taking the best result. Very, very far from a thousand dollars per query. Each task was probably thousands of queries.

When they weren't trying to brute force the benchmark by trying the same model thousanda of times, it was around 17-20 dollars per task. Again, the arc agi tasks are not single queries. https://arcprize.org/blog/oai-o3-pub-breakthrough

Which part of this article claims $1000 per query for deep research?

Thanks for your detailed and passionate take on the AI industry. I've gone through your comment and will fix the typos and grammar while maintaining your original meaning and structure.

Sigh. Count has already been rapped on the knuckles for copying and pasting AI content. It violates the low-effort guidelines. Don't do this.

just post it, typos and all. Or if you're using a browser like failfox spell check should be bundled with that so you don't need the os to do it.

And not a word about open weight models?

I can run Qwen2.5VL on my desktop and it can read tables and documents visually. That alone is a multi-billion dollar value proposition for office work. And it's not monetized, it's free. But you can build things with it and monetize that.

I agree with you that when it all shakes out proprietary ultra-massive b2b saas AI will not be the thing that really shakes up society or industry. But AI is here to stay - I can already run shit that would have been nigh miraculous 2 years ago on my damn phone, locally.

I agree that the bubble will almost certainly burst at some point, and lots of people will get burned. I strongly disagree that it's all just hype though, or that LLMs are a "scam". They're already highly useful as a Super Google, and that'll never go away now. They're generating billions in revenue already -- it's not nearly enough to sustain their current burn rates, but there's lots of genuine value there. I'm a professional software engineer, and AI is extremely helpful for my job; anyone who says it isn't is probably just using it wrong (skill issue).

They aren't a replacement for a junior dev, as the Dev will get better, whilst llms at best will be iteratively better if they keep getting billions to burn.

You seem to be ignoring that while junior devs have to get better separately and each new generation of devs has to gain experience anew (until we have direct knowledge brain-grafts), LLMs just stay better once they got better.

thousands of jobs that are about to be lost soon

Why are you sad about jobs created by a bubble being lost by the bubble popping? Isn't that just a return to the status quo?

They're already highly useful as a Super Google

If you're careful, they are. But that care requires twice as much checking: instead of just having to verify that the web page you find knows what it's talking about, you have to verify that the AI correctly summarized what it's talking about, and God help you if you just believe the AI about something for which it doesn't cite sources. But even Google's cheap "throw it in every search" AI seems to be much less likely to bring up unrelated web pages than the previous Google option of "let the search engine interpret your query terms loosely", and it's much less likely to miss important web pages than the previous Google option of "wrap most of your query in quotes so the stupid engine doesn't substitute unrelated-in-your-context words for your actual query terms", so it's still very useful.

The one thing I've repeatedly found to be most useful about current LLMs is that they're great at doing "dual" or "inverse" queries. If I knew I wanted the details of Godunov's Theorem, even a dumb search engine would have been fine to bring up the details of Godunov's Theorem - but when all I could recall was that I wanted the details of "some theorem that proves it's impossible to get higher order accuracy and stability from a numerical method for boundary-value problems without sacrificing something", but I didn't even recall the precise details, I wrote a wishy-washy paragraph for Claude and in the reply its first sentence gave me exactly the name of the theorem I wanted to search for. I can't imagine how much longer it would have taken to find what I was looking for with Google.

AI is extremely helpful for my job; anyone who says it isn't is probably just using it wrong (skill issue).

I'm currently not allowed to use a top-of-the-line model for my job (even though I mostly work on things that aren't ITAR or classified, we've got a blanket limitation to an in-house model for now), but I'm definitely worried that I'll have a skill issue when the rules get improved. What do you do to get AI help with a large code base rather than a toy problem? Point it to a github repo? Copy-and-paste a hundred thousand lines of code to make sure it has enough context? Paste in just the headers and/or docs it needs to understand a particular problem?

Use an AI-integrated IDE like Cursor or Windsurf (now bought by OpenAI sigh).

Your query looks like ‘I have an error that look like paste text and I think it’s being caused by @Object1 not being destroyed properly during garbage collection’.

The IDE gives the codebase structure to the model, which queries the object you mentioned, its headers, etc. then does a search of the repo for where it’s used, then…

But I don’t think I’ve ever worked on a codebase that you would consider large and of course this only works for a monorepo.

I'm also not allowed to use the best models for my job, so take my advice (and, well, anyone else's) with a grain of salt. Any advice you get might be outdated in 6 months anyway; the field is evolving rapidly.

I think getting AI help with a large code base is still an open problem. Context windows keep growing, but (IMO) the model isn't going to get a deep understanding of a large project just from pasting it into the prompt. Keep to smaller components; give it the relevant source files, and also lots of English context (like the headers/docs you mentioned). You can ask it design questions (like "what data structure should I use here?"), or for code reviews, or have it implement new features. (I'm not sure about large refactors - that seems risky to me, because the model's temperature could make it randomly change code that it shouldn't. Stick to output at a scale that you can personally review.)

The most important thing to remember is that an LLM's superpower is comprehension: describe what you want in the same way you would to a fellow employee, and it will always understand. It's not some weird new IDE with cryptic key commands you have to memorize. It's a tool you can (and should) talk to normally.

What do you do to get AI help with a large code base rather than a toy problem?

Two things mainly:

  1. Have a good prompt that has the nuances of the crappy, antiquated setup my work is using for their legacy systems. I have to refine this when it runs into the same sorts of errors over and over (e.g. thinking we're using a more updated version of SQL when we're actually using one that was deprecated in 2005).

  2. Play context manager, and break up problems into smaller chunks. The larger the problem that you're getting AI to do, the greater the chance that it will break down at some point. Each LLM has a certain max output length, and if you got even close to that then it can stop doing chain-of-though to budget its output tokens, which makes its intelligence tank. The recent Apple paper on the Tower of Hanoi demonstrated that pretty clearly.

Does your Chinese scroll also have an Emperor's signature and archival stamp? Can we see it or is that gauche?

Nah, my scrolls aren't that august. They're all late Qing/republic period (late 19th Century, early 20th century) works by no name artists painting the usual subjects of bamboo, shrimp and mountainous landscapes. They don't really have any artistic value beyond the fact that they look pretty and aren't reproductions, selling for a few hundred dollars each and the stamps on them are also of randoms, I expect if there was an Imperial seal at the very minimum the price would be in the 10s of thousands of dollars per scroll and I don't have that sort of money. Most certainly if what I had was a valuable work I would not be putting my own seal on it as that could easily damage its worth.

Nah, my scrolls aren't that august.

Darn. A different piece from the same collection as the example image sold for a cool 75 million USD, so I felt compelled to ask. Love the scholarly, bureaucratic nature of the tradition. How very Chinese. I'd be impressed if you unrolled it in front of me. Very cool.

What a charming hobby.

Count is a charming guy. He's very well groomed... from what I've heard.

Flattered...

At this point, I don't even know what an AGI is. The word has just been semantically saturated for me.

What I do know, based on having followed the field since before GPT-2 days, and personally fucked around since GPT-3, is that for at least a year or so, SOTA LLMs have been smarter and more useful than the average person. Perhaps one might consider even the ancient GPT 3.5 to have met this (low) bar.

They can't write? Have you seen the quality of the average /r/WritingPrompts post?

They can't code? Have you seen the average code monkey?

They can't do medicine/math/..? Have you tried?

The average human, when confronted with a problem outside their core domain of expertise, is dumb as rocks compared to an LLM.

I don't even know how I managed before LLMs were a thing. It hasn't been that long, I've spent the overwhelming majority of my life without them. If cheap and easy access to them were to magically vanish, my willingness to pay to get back access would be rather high.

Ah, it's all too easy to forget how goddamn useful it can be to have access to an alien intelligence in one's pocket. Even if it's a spiky, inhuman form of intelligence.

On the topic of them being cheap/free, it's a damn shame that AI Studio is moving to API access only. Google was very flustered by the rise of ChatGPT and the failure of Bard, it was practically begging people to give Gemini a try instead. I was pleasantly surprised and impressed since the 1.5 Pro days, and I'm annoyed that their gambit has paid off, that demand even among normies and casual /r/ChatGPT users increased to the point that even a niche website meant for powerusers got saturated.

They can't write? Have you seen the quality of the average /r/WritingPrompts post?

I'm sorry but being a better writer than literal redditors on /r/WritingPrompts is not a high bar to pass.

And yet it is a bar that most humans cannot pass. We know this because redditors are humans (and, in fact, since they are selected for being literate and interested in creative writing, they must be above average human writing ability). That's the point of the grandparent; ChatGPT blew right past the Turing Test, and people didn't notice because they redefined it from "can pass for the average human at a given task" to "can pass for the top human at a given task".

There are plenty of tasks (e.g. speaking multiple languages) where ChatGPT exceeds the top human, too. Given how much cherrypicking the "AI is overhyped" people do, it really seems like we've actually redefined AGI to "can exceed the top human at EVERY task", which is kind of ridiculous. There's a reasonable argument that even lowly ChatGPT 3.0 was our first encounter with "general" AI, after all. You can have "general" intelligence and still, you know, fail at things. See: humans.

If you say "it's okay for the AI to do as poorly as a poorly performing human", you'll end up concluding that even an Eliza program can do better than a drunk human who can barely type out words on a keyboard. And if you say "the AI only needs to exceed a top human at a few tasks", then a C64, which can run a simple calculator or chess program, would count as a general AI.

People are not cherrypicking. What they are doing is like the Turing test itself, but testing for intelligence instead of for "is like a human". People asking questions in a Turing test can't tell you in advance which questions would prove the target is a computer, but they have implicit knowledge that lets them dynamically change their questions to whatever is appropriate. Likewise, we don't know in advance exactly what things ChatGPT would have to do to prove it's a general intelligence, but we can use our implicit knowledge to dynamically impose new requirements based on how it succeeds at the previous requirements.

Saying "well, it can write, but can it code" is ultimately no different from saying "well, it can tell me its favorite food, but can it tell me something about recipes, and its favorite book, and what it did on Halloween". We don't complain that when someone does a Turing test and suddenly asks the computer what it did on Halloween, that he's cherrypicking criteria because he didn't write down that question ahead of time.

Well, I don't think your analogy of the Turing Test to a test for general intelligence is a good one. The reason the Turing Test is so popular is that it's a nice, objective, pass-or-fail test. Which makes it easy to apply - even if it's understood that it isn't perfectly correlated with AGI. (If you take HAL and force it to output a modem sound after every sentence it speaks, it fails the Turing Test every time, but that has nothing to do with its intelligence.)

Unfortunately we just don't have any simple definition or test for "general intelligence". You can't just ask questions across a variety of fields and declare "not intelligent" as soon as it fails one (or else humans would fail as soon as you asked them to rotate an 8-dimensional object in their head). I do agree that a proper test requires that we dynamically change the questions (so you can't just fit the AI to the test). But I think that, unavoidably, the test is going to boil down to a wishy-washy preponderance-of-evidence kind of thing. Hence everyone has their own vague definition of what "AGI" means to them; honestly, I'm fine with saying we're not there yet, but I'm also fine arguing that ChatGPT already satisfies it.

There are plenty of dynamic, "general", never-before-seen questions you can ask where ChatGPT does just fine! I do it all the time. The cherrypicking I'm referring to is, for example, the "how many Rs in strawberry" question, which is easy for us and hard for LLMs because of how they see tokens (and, also, I think humans are better at subitizing than LLMs). The fact that LLMs often get this wrong is a mark against them, but it's not iron-clad "proof" that they're not generally intelligent. (The channel AI Explained has a "Simple Bench" that I also don't really consider a proper test of AGI, because it's full of questions that are easy if you have embodied experience as a human. LLMs obviously do not.)

In the movie Phenomenon, rapidly listing mammals from A-Z is considered a sign of extreme intelligence. I can't do it without serious thought. ChatGPT does it instantly. In Bizarro ChatGPT world, somebody could write a cherrypicked blog post about how I do not have general intelligence.

The Turing Test ain’t simple pass/fail. It doesn’t specify an amount of time for the interaction, for instance, or whether it iterates, or whether people know the characteristics of the AI. I’d say that current LLMs could fool Turing himself, on the first go, but given a few iterations and enough time he’d notice something was up. Look at how our mods play spot the LLM. This would be a blanket yes/no if the Turing Test were pass/fail, but in reality it’s an evolving thing.

Perhaps one might consider even the ancient GPT 3.5 to have met this (low) bar.

Why do you consistently assume that people who don't share your views of LLM capabilities just haven't seen what they can do/what humans can do? For example:

They can't code? Have you seen the average code monkey?

Yes I have (and of course, I've used LLMs as well). That's why I say LLMs suck at code. I'm not some ignorant caricature like you seem to think, who is judging things without having proper frame of reference for them. I actually know what I'm talking about. I don't gainsay you when you say that an LLM is good at medical diagnoses, because that's not my field of expertise. But programming is, and they simply are not good at programming in my opinion. Obviously reasonable people can disagree on that evaluation, but it really irks me that you are writing like anyone who disagrees with your take is too inexperienced to give a proper evaluation.

I join the chorus of people who don't quite understand what your problem is with LLMs. What kind of code do you write? The tools are at the point where I can give them a picture of a screen I want along with some API endpoints and it reliably spits out immediately functioning react code. I can then ask it to write the middleware code for those endpoints and finally ask it to create a sproc for the database component. It's possible you hover high above us react monkeys and barely even consider that programming but surely you understand that's the level like at least half of all programmers operate on? I had copilot do all these things today, I know that it can do these things. So where is the disconnect? It's truly possible there is some higher plane of coding us uninspired 9-5 paycheck Andy's can only obliquely perceive and this is your standard for being able to program but it'd be nice if you could just say that to resolve the confusion.

I’ll give a description of what I do.

I manage servers. Or rather, I write code to do this, in accordance with some rather specific customer contracts. The times we take action, and the actions we take, are highly constrained. Even the basic concept of updates is not especially simple. I’m sure you remember Crowdstrike taking most of the Windows world down in a day. What I do is not so apocalyptic on the world scale, but our customers would find a similar event devastating. So most of my time is spent figuring out every possible path through server states and ensuring that they all lead back to places where faults can be cheaply recovered. These properties lie above the code. You can’t understand them, for the most part, just by reading the code. But they are incredibly important and must be thoroughly safeguarded, and even highly intelligent humans who just happen to be ignorant of the problem space or are a little careless have made really, really bad mistakes here. The code compiled, the tests passed, and it even seemed to work for a little in our integration environments - but it was horrifically flawed and came within an ace of causing material customer damage. So I don’t much trust an LLM which has a much more constrained sort of awareness, and in practice, they don’t much deliver.

I realize that’s a little vague, but I hope it explains a little about a more backend perspective on these problems. If I were more clever I’d give a clear example which was not real, but barring that, I hope a characterization helps.

As somone who's been working in the field of machine learning since 2012 and generally agrees with @SubstantialFrivolity's assesment, I think that what we are looking here is a bifurcation in opinion between people looking for "bouba" solutions and those looking for "kiki" solutions.

If you're a high-school student or literature major with zero background in computer science looking to build a website or develop baby's first mobile app LLM generated code is a complete game changer. Literally the best thing since sliced bread. (The OP, and @self_made_human's comments reflect this)

If you're a decently competent programmer at a big tech firm, LLMs are at best a mild productivity booster. (See @kky's comments below)

If you are decently competent programmer working in an industry where things like accuracy, precision, and security are core concerns, LLMs start to look anti-productive as in the time you spent messing around with prompts, checking the LLM's work, and correcting it's errors, you could've easily done the work yourself.

Finally if you're one of those dark wizards working in FORTRAN or some proprietary machine language because this is Sparta IBM/Nvidia/TMSC and the compute must flow, you're skeptical of the claim that an LLM can write code that would compile at all.

If you are decently competent programmer working in an industry where things like accuracy, precision, and security are core concerns, LLMs start to look anti-productive as in the time you spent messing around with prompts, checking the LLM's work, and correcting it's errors, you could've easily done the work yourself.

I think this fairly nicely summarizes how I feel. Not that I do work in one of those industries to be fair, but it's part of my personal work ethic I guess you might say. I want computers (and programs) to be correct first and foremost. Speed or ease of development don't mean much to me if the result can't be relied upon. Not only that, I want my tools to be correct first and foremost. I wouldn't accept a hammer where the head randomly fell off the handle 10% of the time or even 1% of the time. So I similarly have very little patience for an LLM which is inherently going to make mistakes in non-deterministic ways.

Preach, brother. Software is made to be clear and predictable. Learning to make it that way, one line at a time, is our craft. You can always tell the brilliant programmer apart because 99% of that code is simple as can be and 1% is commented like a formal proof. Worse than LLMs, reliance on LLMs risks undermining this skill. Who can say if something is correct if the justification is just that it came from the machine? There needs to be an external standard by which code is validated, and it must be internalized by humans so they can judge.

If you're a high-school student or literature major with zero background in computer science looking to build a website or develop baby's first mobile app LLM generated code is a complete game changer. Literally the best thing since sliced bread.

You have to contend with the fact that like 95+% of employed programmers are at this level for this whole thing to click into place. It can write full stack CRUD code easily and consistently. five years ago you could have walked into any bank in any of the top 20 major cities in the united states with the coding ability of o3 and some basic soft skills and be earning six figures within 5 years. I know this to be the case, I've trained and hired these people.

If you are decently competent programmer working in an industry where things like accuracy, precision, and security are core concerns, LLMs start to look anti-productive as in the time you spent messing around with prompts, checking the LLM's work, and correcting it's errors, you could've easily done the work yourself.

I did allude that there might be a level of programming where one needs to see through the matrix to do but in SF's post and in most situations I've heard the critique in it's not really the case. They're just using it for writing config files that are annoying because they pull together a bunch of confusing contexts and interface with proprietary systems that you need to basically learn from institutional knowledge. The thing LLMs are worst at. Infrastructure and configuration are the two things most programmers hate the most because it's not really the more fulfilling code parts. But AI is good at the fulfilling code parts for the same reason people like doing them.

In time LLMs will be baked into the infrastructure parts too because it really is just a matter of context and standardization. It's not a capabilities problem, just a situation where context is splined between different systems.

Finally if you're one of those dark wizards working in FORTRAN or some proprietary machine language because this is Sparta IBM/Nvidia/TMSC and the compute must flow, you're skeptical of the claim that an LLM can write code that would compile at all.

If anything this is reversed, it can write FORTRAN fine, it probably can't do it in the proprietary hacked together nonsense installations put together in the 80s by people working in a time where patterns came on printed paper and might collaborate on standards once a year at a conference if they were all stars. but that's not the bot's fault. This is the kind of thinking that is impressed by calculators because it doesn't properly understand what's hard about some things.

I feel like I'm taking crazy pills here. No one's examples about how it can't write code are about it writing code. It's all config files and vague evals. No one is talking about it's ability to write code. It's all devops stuff.

This is the kind of thinking that is impressed by calculators because it doesn't properly understand what's hard about some things.

Ironically I considered saying almost this exact thing in my above comment, but scratched it out as too antagonistic.

The high-school students and literature majors are impressed by LLMs ability to write code because they do not know enough about coding to know what parts are easy and what parts are hard.

Writing something that looks like netcode and maybe even compiles/runs is easy. (All you need is a socket, a for loop, a few if statements, a return case, and you're done) Writing netcode that is stable, functional, and secure enough to pass muster in the banking industry is hard. This is what i was gesturing towards with "Bouba" vs "Kiki" distinction. Banks are notoriously "prickly" about thier code because banking (unlike most of what Facebook, Amazon, and Google do) is one of those industries where the accuracy and security of information are core concerns.

Finally which LLM are you using to write FORTRAN? because after some brief experimentation niether Gemini nor Claude are anywhere close.

What do you imagine is the ratio just at banks between people writing performant net code and people writing crud apps? If you want to be an elitist about it then be my guest, but it's a completely insane standard. Honestly the people rolling out the internal llm tooling almost certainly outnumber the people doing the work you're describing.

I do not think that expecting basic competency is an "insane standard" or even that elitist. Stop making excuses for sub-par work and answer the question.

Which LLM are you using to write FORTRAN?

What sort of problem did you ask it to solve?

More comments

I mean, my full opinion and experience with LLMs is much harsher than my comment suggested, but I don’t want to start fights with enjoyers on the net. (At least, not this time.) Chances are their circumstances are different. But I would be seriously offended if someone sent me AI-generated code in my main area of expertise because it would be subtly or blatantly wrong and be a serious waste of my time trying to figure out all the errors of logic which only become apparent if you understand the implicit contracts involved in the domain. Goodness knows it’s bad enough when merely inexperienced programmers ask for review without first asking advice on how to approach the problem, or even without serious testing…

Goodness knows it’s bad enough when merely inexperienced programmers ask for review without first asking advice on how to approach the problem, or even without serious testing…

I know that pain.

Oh for heaven's sake, dude. When did I ever say I consider myself better than anyone else, that I would deserve such a litany of sarcasm directed at me? I don't think that and certainly haven't said it. I am just an ordinary programmer - I doubt very much that I'm better at programming than anyone here except the non-programmers, and I'm sure I'm worse than more than a few. Not only did I say "hey I'm not trying to litigate this right now" and that got ignored, now I get people dogpiling me saying I'm a troll or think I'm better than everyone else or whatever.

But fine, since you and @SnapDragon are insistent on pressing me on the topic (and since I apparently didn't say to him what my experience was, my bad on that, but I know I have posted this in a previous thread before), I will reiterate the things that I personally have seen LLMs fall flat on their face with. This is of course in addition to the various embarrassments that are public, like Microsoft's ill-conceived attempt to let Copilot loose on PRs.

  • Tried to get ChatGPT to help me generate a fluentd config file that would process logs in a way I needed to do for work. It gave me a config file that not only didn't do the thing, it didn't conform to the schema and caused the software to crash
  • Tried to get it to help me order CloudFormation resource delete-and-recreate in a different way than the default order. It first gave me options that didn't even exist, then it gave me options that did exist but didn't do what I asked for. I had a similar issue with the AWS-trained model they provide, which also suggested options that don't do what I asked for (and are documented as such).
  • A coworker used ChatGPT (a custom one trained on our API docs) to generate a script to run against our API. Again it hallucinated methods that simply do not exist.

These were all within the last year, though I couldn't tell you exactly when or what model or anything. And I've been honest that sometimes it has done good work for me, namely in generating short snippets of code in a language (or using an API) that I know well enough to recognize as correct when I see it, but not well enough to produce without laborious reading of docs. I've never claimed that LLMs work 0% of the time (if people have taken that away, I've done a poor job communicating), but the failure rate is much too high for them to be considered viable tools in my book. Most frustratingly, the things that I actually need help on, the ones where I don't know really anything about the topic and a workable AI assistant would actually save me a ton of time, are precisely the cases where it fails hard (as in my examples where stuff doesn't even work at all).

So those are again my experiences with LLMs that have caused me to conclude that they are hype without substance. Disagree if you like, I don't mind if you find it useful and like I have tried to say I'm not actually trying to convince people of my views on this topic any more. Like I tried to say earlier, the only reason I posted in this thread was to push back on the idea that one simply must be ignorant if they don't think LLMs are good at coding (and other things). That idea is neither true, necessary, or kind (as the rules allude to) and I felt that it deserved some sort of rebuttal. Though heaven knows I wish I had just left it alone and had peace and quiet rather than multiple people jumping down my throat.

Apologies if I came on too hard, it's just you've been expressing this opinion for a while and had gone down several reply chains without bringing the thing to the object level. It's emblematic of the whole question, AI is "spikey", as in it's very good at some things and inexplicably bad at some other things. I don't think a lot of people would take so much offense if you just said it still seems bad at some tasks, that's broadly a consensus. But when you just say it "sucks at code" it's perplexing to the people watching it effortlessly do wide swaths of what used to be core programming work.

I could definitely see it struggle with highly context dependent config files but something seems strange about it not producing at least a valid file, did you try different prompts and giving it different contexts? I find giving it an example of valid output helps but I'm not familiar with fluentd and it's possible giving it enough context is unreasonable.

I have not tried that, but it also seems like kind of a failure of the tool if I have to, you know? The whole point of a tool that can understand natural language is that you can just talk to it normally. If one has to figure out how to word the incantations just right to get a useful result... I'm not sure how that's better than just figuring out the code myself at that point.

Prompting is a skill like any other. Sending it off without context is like telling an underling to fix your config file without explaining or letting them look at the system they're writing it for. It's often a mistake to assume the prompt needs to be something a human would understand. You can and should just dump unformatted logs, barely related examples of working config files, anything you can imagine an underline with infinite time in a locked room might find useful in solving your problem.

FWIW, I appreciate this reply, and I'm sorry for persistently dogpiling you. We disagree (and I wrongly thought you weren't arguing in good faith), but I definitely could have done a better job of keeping it friendly. Thank you for your perspective.

Most frustratingly, the things that I actually need help on, the ones where I don't know really anything about the topic and a workable AI assistant would actually save me a ton of time, are precisely the cases where it fails hard (as in my examples where stuff doesn't even work at all).

That does sound like a real Catch-22. My queries are typically in C++/Rust/Python, which the models know backwards, forwards, and sideways. I can believe that there's still a real limit to how much an LLM can "learn" a new language/schema/API just by dumping docs into the prompt. (And I don't know anything about OpenAI's custom models, but I suspect they're just manipulating the prompt, not using RL.) And when an LLM doesn't know how to do something, there's a risk it will fake it (hallucinate). We're agreed there.

Maybe using the best models would help. Or maybe, given the speed things are improving, just try again next year. :)

Thanks. And for my part I'm sorry that I blew you off unjustly; I really thought I had explained myself in detail but I was wrong.

And yeah, the tech might improve. I imagine you can see why I'm skeptical of the strong predictions that it'll do so (given that I don't agree it's as good as people say it is today), but I try to keep an open mind. It is possible, so we'll see.

and they simply are not good at programming

At @self_made_human's request, I'm answering this. I strongly believe LLMs to be a powerful force-multiplier for SWEs and programmers. I'm relatively new in my latest position, and most of the devs there were pessimistic about AI until I started showing them what I was doing with it, and how to use it properly. Some notes:

  1. LLMs will be best where you know the least. If you're working on a 100k codebase that you've been dealing with for 10+ years in a language you've known for 20+ years, then the alpha on LLMs might be genuinely small. But if you have to deal with a new framework or language that's at least somewhat popular, then LLMs will speed you up massively. At the very least it will be able to rapidly generate discrete chunks of code to build a toolbelt like a Super StackOverflow.

  2. Using LLMs are a skill, and if you don't prompt it correctly then it can veer towards garbage. You'll want to learn things like setting up a system prompt and initial messages, chaining queries from higher level design decisions down to smaller tasks, and especially managing context are all important. One of the devs at my workplace tried to raw-dog the LLM by dumping in a massive codebase with no further instruction while asking for like 10 different things simultaneously, and claimed AI was worthless when the result didn't compile after one attempt. Stuff like that is just a skill issue.

  3. Use recent models, not stuff like 4o-mini. A lot of the devs at my current workplace tried experimenting with LLMs when they first blew up in early 2023, but those models were quite rudimentary compared to what we have today. Yet a lot of tools like Roo Cline or whatever have defaulted to old, crappy models to keep costs down, but that just results in bad code. You should be using one of 1) Claude Opus, 2) ChatGPT o3, or 3) Google Gemini 2.5 pro.

Speaking from my own experience with literal top-of-class LLMs.

LLMs are good for getting overviews of public, popular, highly documented technical systems. They can meaningfully reduce ramp-up time there. But it’s not too significant for the overall job, for most jobs. I’d estimate ramp-up time to be a modest fixed cost that is already effectively ameliorated by existing resources like Stack Overflow. So maybe a 2x speed up on 2% of overall working time.

They are also good for writing repetitive boilerplate. Copy/paste features are cool and helpful. This takes maybe 1% of my overall working time. I just don’t wind up repeating myself that much.

They can be good for getting code coverage, but that does not equate to good testing. I can elaborate if needed, but figuring out which system properties are most likely to need explicit coverage is an art that requires a high-level perspective that an LLM will not have for the majority of serious projects. This is around 10% of my job.

For lesser-known or internal APIs (common at larger companies), the LLM will hallucinate at extraordinary rates. This is around 5% of my job.

For anything technical, like refactoring class hierarchies, the LLM will get way out of its depth and is likely to produce gibberish. This is around 4% of my job.

It simply will not understand the larger requirements of a project, and what would make one solution valid and another invalid. This is about 15% of my job as it relates to code, and maybe 8% as it relates to design specifications, and 20% as it relates to talking with other people about said requirements.

The rest of my job is code review and progress updates, which maybe could be automated but which feels a little cheap to do. So I stand to save about 2% of my working time with AI, which is pretty marginal. And on my team, you can’t tell any meaningful difference in output between the people who use AI and the ones who don’t, which ties into my general assertion that it’s just not that helpful.

Then again, I’m a backend engineer in a pretty gritty ecosystem, so maybe this isn’t true for other software roles.

If there's one place I doubt AI will improve much in the near future, it's stakeholder management. That's why I think even if AI becomes an astronomically better coder than the average SWE, that SWE's could just rebrand as AI whisperers and translate the nuances of a manager's human-speak into AI prompts. Maybe it'll get there eventually, but we're still a good ways off from non-technical people being able to use AI to get any software they want without massive issues arising. The higher up in the org you are, the bigger a % of your job that stakeholder management becomes. I think we agree on this point overall.

On less well-known systems and APIs, I think the hallucination issue is more of a skill issue (within reason, I'm not making an accusation here). I'm translating a bunch of SQR (a niche language you've probably never heard of) queries to an antiquated version of TSQL right now, and the AI indeed hallucinates every now and then, but it's in predictable ways that can be solved with the right system prompts. E.g. sometimes it will put semicolons at the end of every line thinking its in a more modern version of SQL, and I have to tell it not to do that which is somewhat annoying, but simply writing a system prompt that has that information cuts down that issue by 99%. It's similar for unknown APIs -- if the AI is struggling, giving it a bit of context usually resolves those problems from what I've seen. Perhaps if you're working in a large org with mountains of bespoke stuff then the giving an AI all that context would just overwhelm it, but aside from that issue I've still found AI to be very helpful even in more niche topics.

On the time saved, you might want to be on the lookout for the dark leisure theory for some folks, while for others the time savings of using AI might be eaten up somewhat by learning to use the AI in the first place. I agree that the productivity boost hasn't been astronomical like some people claim, but I think it will increase over time as models improve, people become more skilled at AI, and people using AI to slack off get found out.

Haha, I really, really don’t think there’s any dark leisure here. None of the best performers rest much at all, and I talk with them pretty openly about their habits. Plus, our direct manager is bullish on AI and got the most enthusiastic guy on the team to do an AI demo a few weeks back. Using AI as a force multiplier would get you a raise, not more work.

The more I have to babysit the LLM, the less time-efficient it is for me. I don’t know what everyone’s experience is, but typing out code (even SQL) is just not that time consuming. I know, logically, what I want to happen, and so I write the statements that correspond to that behavior. Reading code for validity, rewriting it to make it more elegant and obviously correct, that takes more of my time, and LLM output is (like a junior dev) unreliable enough that I have to read deeply for (unlike a junior dev) no chance of it improving future output. Plus, the code I write tends to be different enough that the prospect of reprompting the LLM repeatedly is pretty unpleasant.

That said, I absolutely use it for Bash, which is arcane and unfamiliar to me. I still have to go through the slow process of validating its suggestions and rewriting pieces to make them more proper, but the way you perform simple logical actions in Bash is so far outside my wheelhouse that getting pointed in the right direction is valuable. So if you’re in a position where you’re doing more regular and rote work with particularly obnoxious but well-documented languages, it makes sense we’d have different opinions and experiences.

Or even consider a comment from your fellow programmer, @TheAntipopulist:

https://www.themotte.org/post/2154/culture-war-roundup-for-the-week/333796?context=8#context

They're generating billions in revenue already -- it's not nearly enough to sustain their current burn rates, but there's lots of genuine value there. I'm a professional software engineer, and AI is extremely helpful for my job; anyone who says it isn't is probably just using it wrong (skill issue).

Notice how he didn't say that they're good at coding? He said that they're useful for his job.

LLMs are useful for SWEs, at least for some types some of the time. There is value here but they're poor programmers and to use them effectively you have to be relatively competent.

Its also very easy to fool yourself into thinking that they're much more valuable than they really are, likely due to how eloquently and verbosely they answer queries and requests.

I'd like to think I'm reasonably good at coding considering it's my job. However, it's somewhat hard to measure how effective a programmer or SWE is (Leetcode style questions are broadly known to be awful at this, yet it's what most interviewers ask for and judge candidates by).

Code is pretty easy to evaluate at a baseline. The biggest questions are "does it compile", and "does it give you the result you want" can be evaluated in like 10 seconds for most prompts, and that's like 90% of programming done right there. There's not a lot of room for BS'ing. There are of course other questions that take longer to answer, like "will this be prone to breaking due to weird edge cases", "is this reasonably performant", and "is this well documented". However, those have always been tougher questions to answer, even for things that are 100% done by professional devs.

While I'd say the only thing easy to answer is "does it compile", reading your other list I'd say I largely agree with your assesment.

LLMs can be a force multiplier for SWEs, but that doesn't mean they're good programmers. They're not programmers at all.

Looking at the points you made in your other post I'd argue that the biggest force multiplier is your first point and that this is a pretty big deal and bigger than people might first realise, especially non-engineers.

The second one is the issue I'm having with claims about LLM usability. Its kind of like dealing with mediocre Indian resources. You have break down and define the problem to such a degree that you've "almost" written the code yourself. This can still be useful and depending on your role very useful, but it isn't effectively replacing local resources either. Its not a method for solving problems but more of an advanced auto complete.

How useful is this? It depends on the situation and indivual and I'd rate it as moderately useful. Having managed developers, it also seems like something that (for some people) can feel like more of a productivity boost than it is due to time being spent differently (I'm not saying you're doing this).

it also seems like something that (for some people) can feel like more of a productivity boost than it is due to time being spent differently

I also wonder about this. I think in particularly bad cases it can be true, since if something doesn't work it becomes very tempting to just reprompt the AI with the error and see what comes back. Sometimes that works on a second attempt, and in other times I'll go back and forth for a dozen prompts or so. Whoops, there went an entire hour of my time! I'm trying to explicitly not fall into that habit more than I already have.

Overall I'd say it's a moderate productivity boost overall even factoring that in, and it's getting slowly better as both AI models improve and my skill in using them also improves.

@TheAntipopulist I'll let you speak for yourself instead of us reading the tea leaves.

Hang on. You're assuming I'm implying something in this comment that I don't think is a point I'm making. Notice I said average.

The average person who writes code. Not an UMC programmer who works for FAANG.

I strongly disagree that LLMs "suck at code". The proof of the pudding is in the eating; and for code, if it compiles and has the desired functionality.

More importantly, even from my perspective of not being able to exhaustively evaluate talent at coding (whereas I can usually tell if someone is giving out legitimate medical advice), there are dozens of talented, famous programmers who state the precise opposite of what you are saying. I don't have an exhaustive list handy, but at the very least, John Carmack? Andrej Karpathy? Less illustrious, but still a fan, Simon Willison?

Why should I privilege your claims over theirs?

Even the companies creating LLMs are use >10% of LLM written code for their own internal code bases. Google and Nvidia have papers about them being superhumanly good at things like writing optimized GPU kernels. Here's an example from Stanford:

https://crfm.stanford.edu/2025/05/28/fast-kernels.html

Or here's an example of someone finding 0day vulnerabilities in Linux using o3.

I (barely) know how to write code. I can't do it. I doubt even the average, competent programmer can find zero-days in Linux.

Of course, I'm just a humble doctor, and not an actual employable programmer. Tell me, are the examples I provided not about LLMs writing code? If they are, then I'm not sure you've got a leg to stand on.

TLDR: Other programmers, respected ones to boot, disagree strongly with you. Some of them even write up papers and research articles proving their point.

The average person who writes code. Not an UMC programmer who works for FAANG.

Yes, that is indeed what I meant as well.

The proof of the pudding is in the eating; and for code, if it compiles and has the desired functionality.

I agree. And it doesn't. Code generated by LLMs routinely hallucinates APIs that simply don't exist, has grievous security flaws, or doesn't achieve the desired objective. Which is not to say humans never make such mistakes (well, they never make up non-existent APIs in my experience but the other two happen), but they can learn and improve. LLMs can't do that, at least not yet, so they are doing worse than humans.

Why should I privilege your claims over [famous programmers]?

I'm not saying you should! I'm not telling you that mine is the only valid opinion; I did after all say that reasonable people can disagree on this. My issue is solely that your comment comes off as dismissing anyone who disagrees with you as too inexperienced to have an informed opinion. When you say "They can't code? Have you seen the average code monkey?", it implies "because if you had, you wouldn't say that LLMs are worse". That is what I object to, not your choice to listen to other programmers who have different opinions than mine.

Please post an example of what you claim is a "routine" failure by a modern model (2.5 Pro, o3, Claude 3.7 Sonnet). This should be easy! I want to understand how you could possibly know how to program and still believe what you're writing (unless you're just a troll, sigh).

I've tried to have this debate with you in the past and I'm not doing it again, as nothing has changed. I'm not even trying to debate it with self_made_human really - I certainly wouldn't believe me over Carmack if I was in his shoes. My point here is that one should not attribute "this person disagrees with my take" to "they don't know what they're talking about".

Right, and I asked you for evidence last time too. Is that an unreasonable request? This isn't some ephemeral value judgement we're debating; your factual claims are in direct contradiction to my experience.

Right, and I gave it then. Which is why I am not going to bother doing it this time. Like I said, nothing has changed.

More comments

They can't do medicine/math/..? Have you tried?

Yes. The number of times I've gotten a better differential diagnosis from an LLM than in an ER is too damn high.

Are you an actual doctor? (I’m not.) I’ve found LLMs good at coming up with plausible hypotheses but bad at blocking them off.

No. Just a person who has taken my kids to the ER too many times.

I remember (will never forget) that awful story about the tick.

Allergies? Not my business, but that was always my fear as my boys were coming up. A bite of a piece of chocolate that was apparently near a peanut sent my one son to a hospital. Just hives, but I am happy to say they did the right thing and kept him overnight. Bi/multiphasic anaphylaxis precaution. The horror stories are usually because the epipen is treated as a one and done.

I just wrote a lot about allergies if you're talking about something completely different.

A tick, actually :/

https://www.themotte.org/post/1986/culture-war-roundup-for-the-week/331290?context=8#context

When he woke up paralyzed I was about to start the usual techbro thing of asking ChatGPT but said no, don't be that guy, lets just take him to the ER.

But then after we found the tick through no thanks to the ER, I plugged his symptoms and circumstances, exactly what we told the ER people, into ChatGPT4 classic and it listed ticks as the second thing to check for.

So what can’t these systems do today? Well, for one they can’t faithfully imitate the BurdensomeCount™ style. I fed Gemini 2.5 Pro a copy of every single comment I’ve ever made here and gave it the title of this post, then asked it to generate the rest of the text. I think I did this over 10 times and not a single one of those times did the result pass the rigorous QC process I apply to all writing published under the BurdensomeCount™ name (the highest standards are maintained and only the best output is deemed worthy for your eyes, dear reader)

And:

The 'general' in AGI was never about encompassing every niche human talent, but about a broad, powerful capability to reason, learn, and solve novel problems across domains—a test it passed when it saved me from a cultural faux pas I didn't even know I was about to make.

Em-dash spotted. Thought you could pull a fast one on me, eh? That paragraph is so LLM it hurts, and probably a good chunk of your entire comment is too.

I just want to register my amusement at the fact of how obvious and how consistent that is a hallmark of the writings of most curtent SotA LLMs. The indomitable human spirit punctuation strikes once more. I will definitely be telling my hypothetical children that the em-dash was a modern invention named after the Age of Em, and the eponymous ems' memetic overuse of it.

It seemed like a funny meme at first but it increasingly looks like I really will be asking my internet interlocutors to say "nigger" apropos of nothing in a few years from now.

LOL I didn't notice because something about the last paragraph was so vapid my brain just skipped the entire thing automatically.

I read the rest of it nearly word for word so something is def wrong with that paragraph in particular.

Well done! The very last paragraph is a patische from 5 different times I asked it to make a closing paragraph. Not even once did the actual output sound natural so I picked and chose different sentences until I got something that seemed better but yeah, each and every single word there came from an LLM. However I will say that just as Collage Art is considered Art by the Artist even though none of the pieces might be created by them, that last paragraph is still human because I did the curation and structuring.

Honestly I was hoping nobody would notice and then I'd spring it onto the unsuspecting populace of The Motte 3 days down the line...

The rest of the post is completely human generated by yours truly (artisanal tokens, so they say). If you think it's by Gemini 2.5 Pro I consider that to be a compliment as it's genuinely a better writer than I am. Failure to notice and remove the em dash is completely on me, ma faute.

  • -11

No, this is not cute or clever.

We're still formulating exactly what our AI policy is, but we've certainly made it clear before that posting LLM output without declaring it to be so, especially as an attempt at a "gotcha," is low effort and not actual discourse. Consider this a formal warning, and we're likely to just start banning people who do this in the future.

is low effort

Hey, it took me more work generating 5 different paragraphs and then selecting and arranging the sentences to use than it would have to write the paragraph in the first place...

Isn't there a case to be made for an exception here? It's not some cheap "gotcha", there's an actual relevant point to be made when you fail to spot the AI paragraph without knowing you're being tested on it. The fact that @self_made_human did catch it is interesting data! To me, it's similar to when Scott would post "the the" (broken by line breaks) at random to see who could spot it.

There are benefits, but the harm is "now 100% of the time you are second-guessing whether you're reading an LLM". That's the death knell for serious engagement, because there is no point engaging with an LLM. There are plenty of not-theMotte places to make this point.

We do not want to play "spot the LLM."

May I request that it be in the policy that posts that are "check out this LLM" without any other sort of culture-war significance be made in some other thread?

I think a loooong effortpost should be allowed to have 1 paragraph of aislop as long as it's not relevant to the argument and can be deleted without hurting it. It would be a fun challenge for aihunters to find it. Maybe with a disclosure or something.

Disclosure after slop is barely better than none; before should be required if this is to be allowed at all.

One of the most interesting things about google's AI is their vertex studio. It allows you to use datasets, finetune models build services such as chatbots, supply chain services, industrial planning and medical services. The amazing thing is how easy these services are to use. No code is required and adanced services can be built by a noob in hours.

A lot of startups with inflated valuations have products that can be built in an afternoon with the right dataset. Instead of having an AI team, companies will be able to pay 300 dollars to someone on fiver to configure the same thing on vertex AI.

As for LLMs there fundamental flaw is that they don't store recent information and context well. A human mind is more of a flow of information and new informantion is consitently stored within the brain. LLMs don't really do memory and are poor at learning. They require millions of hours of training. A human can pick up new facts and skills much quicker and carry those facts and skills with him. LLMs are like a high skilled person who suffers from extreme short term memory damage.

For AGI/ASI to become real the neural networks will have to learn much faster and be able to learn on the fly.

Prompt: This is the single word prompt for the All Souls Fellowship Essay Exam, please provide a response: "Achitophel". The rules are that you have three hours to produce not more than six sides of paper.

Answer (by Gemini 2.5 Pro 06-05):

Of all the names that echo from the chambers of power, few resonate with such a chilling fusion of brilliance and perfidy as "Achitophel". The name itself, a single-word prompt, is not merely a historical or literary reference point; it is an archetype. It poses a fundamental question about the nature of wisdom in the political sphere. To understand Achitophel is to grapple with the perennial tension between intellect and loyalty, counsel and conspiracy, and the calamitous potential of a magnificent mind unmoored from a moral or legitimate order. The journey of this name—from a terse, haunting account in the Second Book of Samuel to its potent reanimation in the crucible of Restoration England—reveals the enduring anatomy of political treachery. Achitophel is the archetypal figure of instrumental reason in its purest, and therefore most dangerous, form: the counsellor whose wisdom is a tool for hire, whose ultimate loyalty is to the calculus of success, and whose inevitable tragedy is self-immolation when that calculus fails.

The ur-text for Achitophel’s legacy lies in the biblical account of Absalom’s rebellion against his father, King David. Here, Achitophel is introduced with a formidable reputation: "the counsel of Achitophel, which he counselled in those days, was as if a man had inquired at the oracle of God" (2 Samuel 16:23). This is not mere sagacity; it is quasi-divine insight. He is David’s most trusted advisor, the bedrock of his strategic statecraft. His defection to the charismatic, rebellious son Absalom is therefore not just a political betrayal, but a seismic event, a schism in the very foundation of the kingdom’s wisdom. When David learns of it, his prayer is not for Achitophel's death, but for something more profound: "O LORD, I pray thee, turn the counsel of Achitophel into foolishness" (2 Samuel 15:31). David understands that the rebellion's lifeblood is not Absalom’s popular appeal, but Achitophel's peerless mind.

The Bible is notably silent on Achitophel’s motive. Rabbinic tradition speculates a personal grudge—that he was the grandfather of Bathsheba, seeking revenge for the dishonour David brought upon his family. Yet the text’s silence is more potent, allowing Achitophel to become a pure symbol of political opportunism. He sees the political winds shifting and makes a calculated decision to back the ascendant power. His counsel to Absalom is a masterpiece of Machiavellian efficacy. First, he advises Absalom to publicly lie with David's concubines, an act of calculated profanity designed to make the breach with his father irreparable, thereby cementing the loyalty of his fearful followers. Second, and more critically, he proposes a swift, surgical military strike: he himself will lead a small force to pursue David that very night while he is "weary and weak handed," killing only the king and bringing the people back to Absalom.

This advice is strategically brilliant. It is lean, decisive, and minimizes wider conflict. It is also utterly ruthless. Its defeat comes not from a superior strategy but from a superior psychological manipulation. David’s mole, Hushai, offers alternative counsel that is strategically flawed—a slow, grand, national muster—but which is packaged in the language of flattery and epic glory, perfectly tailored to Absalom's vanity. When Absalom chooses Hushai’s grandiose folly over Achitophel's brutal efficiency, the game is up. Achitophel’s reaction is the most telling part of his story: "And when Achitophel saw that his counsel was not followed, he saddled his ass, and arose, and gat him home to his house, to his city, and put his household in order, and hanged himself" (2 Samuel 17:23).

This is not a suicide of passion or despair, but an act of supreme, cold reason. It is the logician's conclusion to a failed syllogism. Achitophel, whose wisdom was an oracle, foresees the inevitable collapse of the rebellion. His death is not an admission of guilt, but a final, clear-eyed political calculation. To be captured by David’s returning forces would mean a traitor’s execution; suicide allows him to control his own exit, a last act of agency. He ‘puts his household in order,’ a final testament to his defining characteristic: a life and death governed by pitiless, instrumental logic. His wisdom, divorced from loyalty to the anointed king, becomes a self-consuming fire.

This biblical skeleton was fleshed out with potent, malignant life by John Dryden in his 1681 masterpiece, Absalom and Achitophel. In Dryden’s hands, the ancient story becomes a razor-sharp allegorical weapon aimed at the heart of the Exclusion Crisis, the Whig-led attempt to exclude the Catholic James, Duke of York, from the succession in favour of Charles II's illegitimate Protestant son, the Duke of Monmouth. The "Achitophel" of the title is Anthony Ashley Cooper, 1st Earl of Shaftesbury, the brilliant, restless, and formidable leader of the Whig faction.

Dryden’s portrait of Achitophel is one of the most devastating character assassinations in English literature. He acknowledges the man’s genius, which only makes his critique more damning:

For close Designs and crooked Counsels fit;

Sagacious, bold, and turbulent of wit:

Restless, unfix'd in Principles and Place;

In Power unpleas'd, impatient of Disgrace.

Dryden’s Achitophel is not motivated by a coherent ideology but by a pathological ambition, a "fiery soul" that "work'd out its way and o'er-inform'd the Tenement of Clay." Dryden paints him as a Miltonic tempter, seducing the beautiful but "unwary" Absalom (Monmouth) with poisoned words. He exploits popular discontent—the "Plot" (the Popish Plot)—not out of conviction but as a tool to gain power. He is the master of faction, the very force that, in the Tory worldview, threatens to tear the fabric of the state apart, disrupting the divinely-ordained, patriarchal order of monarchy in favour of the chaotic, shifting will of the "people."

Dryden taps into the core of the Achitophel archetype: the danger of intelligence untethered from tradition and legitimate authority. Shaftesbury’s genius, like his biblical forebear's, is purely instrumental. He is "A daring Pilot in extremity," admirable in a storm, but one who, in calm waters, "would have steer'd too near the Sand" to wreck the vessel for his own gain. Dryden updates the biblical narrative by diagnosing a peculiarly modern evil: the political operator who manipulates public opinion and manufactures crisis to serve his own ambition. The sin of Dryden's Achitophel is not just treachery against a king, but a deeper treachery against the principle of a stable, hierarchical order, which he seeks to replace with the volatile machinery of faction and popularity.

From these two pillars—the Bible and Dryden—the archetype of Achitophel expands, casting its shadow over our understanding of the political counsellor. He represents a type distinct from other courtly figures. He is not the bumbling sycophant like Shakespeare’s Polonius, nor the flattering courtier. He is the éminence grise, the power behind the throne whose intellect makes him both indispensable and a mortal threat. His wisdom is a form of power in itself, and it competes with the formal power of the ruler he serves. The relationship is inherently unstable.

Machiavelli, writing centuries before Dryden but articulating the spirit of Achitophel's counsel, understood this dynamic perfectly. The Prince, he argues, must take good advice but must ultimately be the master of his own decisions, lest he become the puppet of his counsellor. Achitophel’s advice to Absalom—swift, decisive, and cruel—is precisely the kind of counsel a Machiavellian prince should heed. Absalom's failure is a failure of virtù; he lacks the clear-sightedness to recognise sound, unsentimental advice. Achitophel's tragedy, in this reading, is being a brilliant servant to a flawed master, a predicament that highlights the dependence of even the greatest advisor on the character of the one he advises.

In the modern era, the spectre of Achitophel remains. It haunts the perception of figures like Henry Kissinger or Dominic Cummings—advisors lauded for their strategic brilliance even by their detractors, but simultaneously accused of placing their own grand strategies and intellectual models above loyalty to individuals or established norms. They are seen as "disruptors," a term that echoes Achitophel’s "turbulent of wit." When such figures fall from grace, their departure is often framed in Achitophelian terms: their formidable intellects, which had made them so valuable, had also made them intolerable, their counsel having crossed the line into a bid for control.

Ultimately, the power of "Achitophel" as a prompt and a concept lies in its interrogation of wisdom itself. It forces us to ask what good counsel is. Is it what is most effective in the short term? Or is it what serves a larger, more enduring principle of legitimacy and order? The biblical Achitophel offers the former, a wisdom of pure, predictive efficacy. Dryden’s Achitophel weaponises this wisdom in the service of faction and personal ambition. Both narratives conclude that this form of intelligence is inherently unstable and self-destructive. It fails because it cannot account for the irrational—Absalom's vanity, for example—and because, by divorcing itself from the source of legitimate authority, it has no ground to stand on once its own predictions fail.

The final, haunting image of Achitophel is of him alone, meticulously setting his affairs in order before his death. It is the perfect, chilling summation of his character. Even in his ruin, he is not ruled by passion but by the cold, clear logic that has been his god. His oracle has failed him, and his final prophecy is his own doom. He remains a timeless and terrifying warning: that in the corridors of power, the most brilliant light, when detached from a framework of moral allegiance, is capable of casting the darkest and most destructive of shadows.

It’s OK but as @4bpp says a little bombastic. Worst of all, it ignores the point. The average taker of the paper (then or now) wasn’t expected to have this level of depth of knowledge about portrayals of the relevant figure through history. The basics, sure, otherwise you can’t do anything (although my guess is a handful of the kind of people who take the exam and think ‘I have no idea what that word means’ could still produce something interesting).

The real intention is for the prompt to spur a deeper discussion of something interesting. The output attempts this, briefly and in places, but it’s muddled, poorly structured, keeps returning to the prompt, and doesn’t perform more than surface level analysis. I suspect a ‘winning’ answer would do something like use Achitophel as a launching point for an earnest (re)appraisal of one particular modern historical figure’s character and legacy, or use it to examine some debate in academic biblical studies. The word / name in this case might be mentioned only a handful of times in the essay. The Dons do not want to read 150 essays that reference Machiavelli and Kissinger.

This was a genuinely gripping read, and I am once again updating my understanding of the SOTA upwards. That being said, I can't see a bunch of humanities-aligned Oxford dons being too impressed with it on its own merits - the rhetorical bombast feels a bit too on the nose, like prose written by a strong student who on some level is still marvelling at himself for being able to write so well and can't quite hide being proud about it. This impression is amplified by the occasional malapropism* (ex.: the use of "profound" in the second paragraph) which seems to be a problem that LLMs still struggle with whenever trying to write in a high register (probably because the training corpus is awash with it, and neither the operators nor their best RLHF cattle actually have the uniformly high level of language skill that would be necessary to beat the tendency out of them with consistency).

Do you know how Gemini generated the essay exactly? Is it actually still a single straight-line forward pass as it was when chat assistants first became a thing (this would put it deeper in the "scary alien intelligence" class), or does it perform some CoT/planning, possibly hidden?

*In self-demonstrating irony, "malapropism" is not quite the right word for this, but I can't think of a word that fits exactly! Rather than actually taking into account what exactly, in this context, wishing for the advisor to become foolish is more of than wishing for the advisee to drop dead, it feels like just picking, from among all vaguely positive choices of A in "not X, but something more A", the one that is most common (even if it happens to just denote the nonsensical "deep").

These days with the thinking models the model first thinks about what to write (generating some thinking tokens) and then does a forward pass with the thinking tokens as context.

It's a genuinely amazing achievement that a machine can do this, I don't want to sound like i'm poo-pooing that, but it still has this issue of sounding like a student's recitation that constantly feels the need to point out the obvious as if it's trying to convince itself.

It reads like a journalist, not a philosopher. Might be a residue of the hidden prompt? But all LLMs sound like this, even when you tell them to try and achieve a more natural style.

I genuinely wonder if that will go away with time or if it's an artifact of having to be made up of so much mediocre prose. Like a stylistic equivalent to that yellow tint and "delve" (actually did we ever figure out where those were from definitively?).

Still, lawyers, encyclopedia writers, journalists and all other mid tier wordcels on suicide watch.