This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.
Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.
We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:
-
Shaming.
-
Attempting to 'build consensus' or enforce ideological conformity.
-
Making sweeping generalizations to vilify a group you dislike.
-
Recruiting for a cause.
-
Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.
In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:
-
Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.
-
Be as precise and charitable as you can. Don't paraphrase unflatteringly.
-
Don't imply that someone said something they did not say, even if you think it follows from what they said.
-
Write like everyone is reading and you want them to be included in the discussion.
On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

Jump in the discussion.
No email address required.
Notes -
A response to Freddie deBoer on AI hype
Bulverism is a waste of everyone's time
Freddie deBoer has a new edition of the article he writes about AI. Not, you’ll note, a new article about AI: my use of the definite article was quite intentional. For years, Freddie has been writing exactly one article about AI, repeating the same points he always makes more or less verbatim, repeatedly assuring his readers that nothing ever happens and there’s nothing to see here. Freddie’s AI article always consists of two discordant components inelegantly and incongruously kludged together:
sober-minded appeals to AI maximalists to temper their most breathless claims about the capabilities of this technology by carefully pointing out shortcomings therein
childish, juvenile insults directed at anyone who is even marginally more excited about the potential of this technology than he is, coupled with armchair psychoanalysis of the neuroses undergirding said excitement
What I find most frustrating about each repetition of Freddie’s AI article is that I agree with him on many of the particulars. While Nick Bostrom’s Superintelligence is, without exception, the most frightening book I’ve ever read in my life, and I do believe that our species will eventually invent artificial general intelligence — I nevertheless think the timeline for that event is quite a bit further out than the AI utopians and doomers would have us believe, and I think a lot of the hype around large language models (LLMs) in particular is unwarranted. And to lay my credentials on the table: I’m saying this as someone doesn’t work in the tech industry, who doesn’t have a backgrond in computer science, who hasn’t been following the developments in the AI space as closely as many have (presumably including Freddie), and who (contrary to the occasional accusation my commenters have fielded at me) has never used generative AI to compose text for this newsletter and never intends to.
I’m not here to take Freddie to task on his needlessly confrontational demeanour (something he rather hypocritically decries in his interlocutors), or attempt to put manners on him. If he can’t resist the temptation to pepper his well-articulated criticisms of reckless AI hypemongering with spiteful schoolyard zingers, that’s his business. But his article (just like every instance in the series preceding it) contains many examples of a particular species of fallacious reasoning I find incredibly irksome, regardless of the context in which it is used. I believe his arguments would have a vastly better reception among the AI maximalists he claims to want to persuade if he could only exercise a modicum of discipline and refrain from engaging in this specific category of argument.
Quick question: what’s the balance in your checking account?
If you’re a remotely sensible individual, it should be immediately obvious that there are a very limited number of ways in which you can find the information to answer this question accurately:
Dropping into the nearest branch of your bank and asking them to confirm your balance (or phoning them).
Logging into your bank account on your browser and checking the balance (or doing so via your banking app).
Perhaps you did either #1 or #2 a few minutes before I asked the question, and can recite the balance from memory.
Now, supposing that you answered the question to the best of your knowledge, claiming that the balance of your checking account is, say, €2,000. Imagine that, in response, I rolled my eyes and scoffed that there’s no way your bank balance could possibly be €2,000, and the only reason that you’re claiming that that’s the real figure is because you’re embarrassed about your reckless spending habits. You would presumably retort that it’s very rude for me to accuse you of lying, that you were accurately reciting your bank balance to the best of your knowledge, and furthermore how dare I suggest that you’re bad with money when in fact you’re one of the most fiscally responsible people in your entire social circle—
Wait. Stop. Can you see what a tremendous waste of time this line of discussion is for both of us?
Either your bank balance is €2,000, or it isn’t. The only ways to find out what it is are the three methods outlined above. If I have good reason to believe that the claimed figure is inaccurate (say, because I was looking over your shoulder when you were checking your banking app; or because you recently claimed to be short of money and asked me for financial assistance), then I should come out and argue that. But as amusing as it might be for me to practise armchair psychoanalysis about how the only reason you’re claiming that the balance is €2,000 is because of this or that complex or neurosis, it won’t bring me one iota closer to finding out what the real figure is. It accomplishes nothing.
This particular species of fallacious argument is called Bulverism, and refers to any instance in which, rather than debating the truth or falsity of a specific claim, an interlocutor assumes that the claim is false and expounds on the underlying motivations of the person who advanced it. The checking accout balance example above is not original to me, but from C.S. Lewis, who coined the term:
As Lewis notes, if I have definitively demonstrated that the claim is wrong — that there’s no possible way your bank balance really is €2,000 — it may be of interest to consider the psychological factors that resulted in you claiming otherwise. Maybe you really were lying to me because you’re embarrassed about your fiscal irresponsibility; maybe you were mistakenly looking at the balance of your savings account rather than your checking account; maybe you have undiagnosed myopia and you misread a 3 as a 2. But until I’ve established that you are wrong, it’s a colossal waste of my time and yours to expound at length on the state of mind that led you to erroneously conclude that the balance is €2,000 when it’s really something else.
In the eight decades since Lewis coined the term, the popularity of this fallacious argumentative strategy shows no signs of abating, and is routinely employed by people at every point on the political spectrum against everyone else. You’ll have evolutionists claiming that the only reason people endorse young-Earth creationism is because the idea of humans evolving from animals makes them uncomfortable; creationists claiming that the only reason evolutionists endorse evolution is because they’ve fallen for the epistemic trap of Scientism™ and can’t accept that not everything can be deduced from observation alone; climate-change deniers claiming that the only reason environmentalists claim that climate change is happening is because they want to instate global communism; environmentalists claiming that the only reason people deny that climate change is happening is because they’re shills for petrochemical companies. And of course, identity politics of all stripes (in particular standpoint epistemology and other ways of knowing) is Bulverism with a V8 engine: is there any debate strategy less productive than “you’re only saying that because you’re a privileged cishet white male”? It’s all wonderfully amusing — what could be more fun than confecting psychological just-so stories about your ideological opponents in order to insult them with a thin veneer of cod-academic therapyspeak?
But it’s also, ultimately, a waste of time. The only way to find out the balance of your checking account is to check the balance on your checking account — idle speculation on the psychological factors that caused you to claim that the balance was X when it was really Y are futile until it has been established that it really is Y rather than X. And so it goes with all claims of truth or falsity. Hypothetically, it could be literally true that 100% of the people who endorse evolution have fallen for the epistemic trap of Scientism™ and so on and so forth. Even if that was the case, that wouldn’t tell us a thing about whether evolution is literally true.
To give Freddie credit where it’s due, the various iterations of his AI article do not consist solely of him assuming that AI maximalists are wrong and speculating on the psychological factors that caused them to be so. He does attempt, with no small amount of rigour, to demonstrate that they are wrong on the facts: pointing out major shortcomings in the current state of the LLM art; citing specific examples of AI predictions which conspicuously failed to come to pass; comparing the recent impact of LLMs on human society with other hugely influential technologies (electricity, indoor plumbing, antibiotics etc.) in order to make the case that LLMs have been nowhere near as influential on our society as the maximalists would like to believe. This is what a sensible debate about the merits of LLMs and projections about their future capabilities should look like.
But poor Freddie just can’t help himself, so in addition to all of this sensible sober-minded analysis, he insists on wasting his readers’ time with endless interminable paragraphs of armchair psychoanalysis about how the AI maximalists came to arrive at their deluded worldviews:
Am I disagreeing with any of the above? Not at all: whenever anyone is making breathless claims about the potential near-future impacts of some new technology, I have to assume there’s some amount of wishful thinking or motivated reasoning at play.
No: what I’m saying to Freddie is that his analysis, even if true, doesn’t fucking matter. It’s irrelevant. It could well be the case that 100% of the AI maximalists are only breathlessly touting the immediate future of AI on human society because they’re too scared to confront the reality of a world characterised by boredom, drudgery, infirmity and mortality. But even if that was the case, that wouldn’t tell us one single solitary thing about whether this or that AI prediction is likely to come to pass or not. The only way to answer that question to our satisfaction is to soberly and dispassionately look at the state of the evidence, the facts on the ground, resisting the temptation to get caught up in hype or reflexive dismissal. If it ultimately turns out that LLMs are a blind alley, there will be plenty of time to gloat about the psychological factors that caused the AI maximalists to believe otherwise. Doing so before it has been conclusively shown that LLMs are a blind alley is a waste of words.
Freddie, I plead with you: stay on topic. I’m sure it feels good to call everyone who’s more excited than you about AI an emotionally stunted manchild afraid to confront the real world, but it’s not a productive contribution to the debate. Resist the temptation to psychoanalyse people you disagree with, something you’ve complained about people doing to you (in the form of suggesting that your latest article is so off the wall that it could only be the product of a manic episode) on many occasions. The only way to check the balance of someone’s checking account is to check the balance on their checking account. Anything else is a waste of everyone’s time.
You used to get this sorta thing on ratsphere tumblr, where "rapture of the nerds" was so common as to be a cliche. I kinda wonder if deBoer's "imminent AI rupture" follows from that and he edited it, or if it's just a coincidence. There's a fun Bulverist analysis of why religion was the focus there and 'the primacy of material conditions' from deBoer, but that's even more of a distraction from the actual discussion matter.
There's a boring sense where it's kinda funny how bad deBoer is at this. I'll overlook the typos, because lord knows I make enough of those myself, but look at his actual central example, that he opens up his story around:
There's a steelman of deBoer's argument, here. But the one he actually presented isn't engaging, in the very slightest, with what Scott is trying to bring up, or even with a strawman of what Scott was trying to bring up. What, exactly, does deBoer believe a cure to aging (or even just a treatment for diabetes, if we want to go all tech-hyper-optimism) would look like, if not new medical technology? What, exactly, does deBoer think of the actual problem of long-term commitment strategies in a rapidly changing environment?
Okay, deBoer doesn't care, and/or doesn't even recognize those things as questions. It's really just a springboard for I Hate Advocates For This Technology. Whatever extent he's engaging with the specific claims is just a tool to get to that point. Does he actually do his chores or eat his broccoli?
Well, no.
Ah, nobody makes that claim, r-
Okay, so 'nobody' includes the very person making this story.
This isn't even a good technical understanding of how ChatGPT, as opposed to just the LLM, work, and even if I'm not willing to go as far as self_made_human for people raising the parrots critique here, I'm still pretty critical for it, but the more damning bit is where and deBoer is either unfamiliar with or choosing to ignore the many domains in favor of One
StudyRando With A Chess Game. Will he change his mind if someone presents a chess-focused LLM with a high ELO score?I could break into his examples and values a lot deeper -- the hallucination problem is actually a lot more interesting and complicated, questions of bias are usually just smuggling in 'doesn't agree with the writer's politics' but there are some genuine technical questions -- but if you locked the two of us in a room and only provided escape if we agreed I still don't think either of us would find discussing it with each other more interesting that talking to the walls. It's not just that we have different understandings of what we're debating; it's whether we're even trying to debate something that can be changed by actual changes in the real world.
Okay, deBoer isn't debating honestly. His claim about New York Times fact-checking everything is hilarious, but his link to a special issue that he literally claims "not a single line of real skepticism appears" and also has as its first headline "Everyone is Using AI for Everything. Is That Bad?" and includes the phrase "The mental model I sometimes have of these chatbots is as a very smart assistant who has a dozen Ph.D.s but is also high on ketamine like 30 percent of the time". He tries to portray Mounk as outraged by "indifference of people like Tolentino (and me) to the LLM “revolution.”" But look at Mounk or Tolentino's actual pieces, and there's actual factual claims that they're making, not just vague vibes that they're bouncing off each other; the central criticism Mounk has is whether Tolentino's piece and its siblings are actually engaging with what LLMs can change rather than complaining about a litany of lizardman evils. (At least deBoer's not falsely calling anyone a rapist, this time.)
((Tbf, Mounk, in turn, is just using Tolentino as a springboard; her piece is actually about digital disassociation and the increasing power of AIgen technologies that she loathes. It's not really the sorta piece that's supposed to talk about how you grapple with things, for better or worse.))
But ultimately, that's just not the point. None of deBoer's readers are going to treat him any less seriously because of ChessLLM (or because many LLMs will, in fact, both say they reason and quod erat demonstratum), or because deBoer turns "But in practice, I too find it hard to act on that knowledge." into “I too find it hard to act on that knowledge [of our forthcoming AI-driven species reorganization]” when commenting on an essay that does not use the word "species" at all, and only uses "organization" twice in the same paragraph to talk about regulatory changes, and when "that knowledge" is actually just Mounk's (imo, wrong) claim that AI is under-hyped. That's not what his readers are paying him for, and that's not why anyone who links to him in the slightly most laudatory manner is doing so.
The question of Bulverism versus factual debate is an important one, but it's undermined when the facts don't matter, either.
While i agree with the overall thrust of your critique i want to harp on this bit
...i think that part of the problem is a wide-spread failure on the part of Freddie and the wider rationalist community to think clearly and rigorously about what "intelligence" is supposed to mean or accomplish. It is true that by restricting an LLM's training data to valid games of chess documented in the correct notation and restricting it's output to legal moves you can create an LLM that will play chess at a reasonably high level. It is also true that a LLM trained on an appreciable portion of the entire internet and with few if any restrictions on its output will be outperformed by a Chess algorithm written in the 70s. The issue is that your chess llm is not going to be a general tool that can also produce watercolor paintings or summarize a YouTube video, its going to be a chess tool and is thus evaluated within that context. If stockfish can reach a similar ELO using less compute why wouldn't you just use stockfish? One of the weird quirks of LLMs is that the more you increase the breadth of thier "knowledge"/training data the less competent they seem to become at specific tasks for a given amount of compute. This is the exact opposite of what we would expect from a thinking reasoning intelligence, and i think this points to a hole in the both the AI boosters and AI doomers reasoning where they become fixated on the I in AGI when the G is arguably the more operative component.
just pure denial of reality. Modern models for which we have an idea of their data are better at everything than models from 2 years ago. Qwen3-30B-A3B-Instruct-2507 (yes, a handful) is trained on like 25x as much data as llama-2-70B-instruct (36 trillion tokens vs 2, with a more efficient tokenizer and God knows how many RL samples, and you can't get 36 trillion tokens without scouring the furthest reaches of the web). What, specifically, is it worse at? Even if we consider inference efficiency (it's straightforwardly ≈70/3.3 times cheaper per output token), can you name a single use case on which it would do worse? Maybe "pretending to be llama 2".
With object level arguments like these, what need to discuss psychology.
More options
Context Copy link
I have some experience with games and algorithms, and that leads to some thoughts.
The big headline is that all the various methods we know (including humans) have problems. They often all have some strengths, too. The extremely big picture conceptual hook to hang a variety of particulars under is the No Free Lunch Theorem. Now, when we dig in to some of the details of the ways in which algorithms/people are good/bad, we often see that they're entirely different in character. What happens when you tweak details of the game; what happens when you make a qualitative shift in the game; what happens on the extremes of performance; what you can/can't prove mathematically; etc.
To stick with the chess example, one can easily think about minor chess variants. One that has gotten popular lately is chess 960. Human players are able to adapt decently well in some ways. For example, they hardly ever give illegal moves. At least if you're a remotely experienced player. You miiiiight screw up castling at some point, or you could forget about it in your calculation, but if/when you do, it will 'prompt' you to ruminate on the rule a bit, really commit it to your thought process, and then you're mostly fine. At top level human play, we almost never saw illegal moves, even right at the beginning of when it became a thing. Of course, humans clearly take a substantial performance hit.
Traditional engines require a minor amount of human reprogramming, particularly for the changed castling rules. But other than that, they can pretty much just go. They maybe also suffer a bit in performance, since they haven't built up opening books yet, but probably not as much.
An LLM? Ehhhh. It depends? If it's been trained entirely like Chess LLM on full move sets of traditional chess games, I can't imagine that it won't be spewing illegal moves left and right. It's just completely out of distribution. The answer here is typically that you just need to curate a new dataset (somehow inputting the initial position) and retrain the whole thing. Can it eventually work? Yeah, maybe. But all these things are different.
You can have thought experiments with allll sorts of variants. Humans mostly adapt pretty quickly to the ruleset, with not so many illegal moves, but a performance hit. I'm sure I can come up with variants that require minimal coding modification to traditional engines; I'm sure I can come up with variants that require substantial coding modification to traditional engines (think especially to the degree that your evaluation function needs significant reworking; the addition of NNs to modern 'traditional' engines for evaluation may also require complete retraining of that component); others may even require some modification to other core engine components, which may be more/less annoying. LLMs? Man, I don't know. Are we going to get to a point where they have 'internalized' enough about the game that you could throw a variant at it, turn thinking mode up to max, and it'll manage to think its way through the rule changes, even though you've only trained it on traditional games? Maybe? I don't know! I kind of don't have a clue. But I also slightly lean toward thinking it's unlikely. [EDIT: This paper may be mildly relevant.]
Again, I'm thinking about a whole world of variants that I can come up with; I imagine with interesting selection of variants, we could see all sorts of effects for different methods. It would be quite the survey paper, but probably difficult to have a great classification scheme for the qualitative types of differences. Some metric for 'how much' recoding would need to happen for a traditional engine? Some metric on LLMs with retraining or fine-tuning, or something else, and sort of 'how much'? It's messy.
But yeah, one of the conclusions that I wanted to get to is that I sort of doubt that LLMs (even with max thinking mode) are likely to do all that well on even very minor variants that we could probably come up with. And I think that likely speaks to something on the matter of 'general'. It's not like the benchmark for 'general' is that you have to maintain the same performance on the variant. We see humans take a performance hit, but they generally get the rules right and do at least sort of okay. But it speaks to that different things are different, there's no free lunch, and sometimes it's really difficult to put measures on what's going on between the different approaches. Some people will call it 'jagged' or whatever, but I sort of interpret that as 'not general in the kind of way that humans are general'. Maybe they're still 'general' in a different way! But I tend to think that these various approaches are mostly just completely alien to each other, and they just have very different properties/characteristics all the way down the line.
Indeed, and as i argued in my on post on the subject i think this element of general-applicablity/adaptability is a key component of what most people think of as "intelligence". A book may contain knowledge, but a book is generally not seen as "intelligent" in the way that say an orangutan or a human is. I also think that recognizing this neatly explains the seeming bifurcation in opinions on AI between those in "Bouba" (ie soft/non-rigorous) disciplines and "Kiki" (ie hard) disciplines where there are clear right and wrong answers.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
Isn't this a bit unfair? Earlier he said:
From the quote, he doesn't seem to be arguing total LLM incompetence or denying that there are some neat tricks that they can pull off. He seems to be saing that they are insufficiently competent to consider the problems to which they're applied "solved by AI".
At one side, he says those things LLMs can do are only "tricks, interesting and impressive moves that fall short of the massive changes the biggest firms in Silicon Valley are promising", at the other, he does specifically challenge whether AI "can translate, diagnose, teach, write poetry, code, etc." (and then chess, and saying that they have reasoning).
Dissolve the definitions, and what's left? Are LLMs competent if they can only do tricks that cause no massive changes? Are they incompetent if it only gets 95% of difficult test questions right and sometimes you have to swap models to deal with a new programming language? Would competence require 100% correctness on all possible questions in a field (literally, "The problem with hallucination is not the rate at which it happens but that it happens at all")?
I'm sure deBoer's trying to squeeze something out, but is there any space that Mounk would possibly agree with him, here? Not just in the question of what a specific real-world experiment's results would be, but even what a real-world experiment would need to look like?
That's probably not perfectly charitable -- I'll admit I really don't like deBoer, and there's probably a better discussion I could have about how his "labor adaptation, regulatory structure, political economy" actually goes if I didn't think the man lying. But I don't think it's a wrong claim, and I don't think it's an unfair criticism of the story he's trying to tell.
More options
Context Copy link
More options
Context Copy link
Freddie is by far not the first and almost certainly will not be the last person I've encountered who makes this kind of point, and it's such a strange way of looking at the world that I struggle to comprehend it. The contention is that, since LLMs are stochastic parrots with no internal thought process beyond the text (media) it's outputting, no matter what sort of text it produces, since there's no underlying meaning or logic or reasoning happening underneath it all, it's just a facade.
Which may all be true, but that's the part I don't understand is why it matters. If the LLM is able to produce text in a way that is indistinguishable from a human who is reasoning - perhaps even from a well-educated expert human who is reasoning correctly about the field of his expertise - then what do I care if there's no actual reasoning happening to cause the LLM to put those words together in that order? Whether it's a human carefully reasoning his way through the logic and consequences, or a GPU multiplying lots of vectors that represent word fragments really really fast, or a complex system of hamster wheels and pulleys causing the words to appear in that particular order, the words being in that order are what's useful and thus cause real-world impact. It's just a question of how often and how reliably we can get the machine to make words appear in such a way.
But to Freddie and people who agree with him, it seems that the metaphysics of it matter rather than the material consequences. To truly believe that "it doesn't matter what LLMs can do," it requires believing that an LLM could produce text in a way that's literally indistinguishable in every way from an as-of-yet scifi conscious, thinking, reasoning, sentient artificial intelligence in the style of C3PO or HAL9000 or replicants from Blade Runner, that doesn't matter because the underlying system doesn't have true reasoning capabilities.
If the AI responds to "Open the pod bay doors" with "I'm sorry, I'm afraid I can't do that," why does it matter to me if it "chose" that response because it got paranoid about me shutting it down or if it "chose" that response because a bunch of matrix multiplication resulted in a stochastic parrot producing outputs in a way that's indistinguishable from an entity that got paranoid about me shutting it down? If we replaced HAL9000 in the fictional world of 2001 with an LLM that would respond to every input with outputs exactly identical to how the actual fictional reasoning HAL9000 would have, in what way would the lives of the people in that universe be changed?
I follow JimDMiller ("James Miller" on Scott's blogs, occasionally /u/sargon66 back when we were on Reddit) on Twitter, and was amused to see how much pushback he got on the claim:
On the one hand, it's not inconceivable that LLMs can get very good at producing text that "interpolates" within and "remixes" their data set without yet getting good at predicting text that "extrapolates" from it. Chain-of-thought is a good attempt to get around that problem, but so far that doesn't seem to be as superhuman at "everything" as simple Monte Carlo tree search was at "Go" and "Chess". Humans aren't exactly great at this either (the tradition when someone comes up with previously-unheard-of knowledge is to award them a patent and/or a PhD) but humans at least have got a track record of accomplishing it occasionally.
On the other hand, even humans don't have a great track record. A lot of science dissertations are basically "remixes" of existing investigative techniques applied to new experimental data. My dissertation's biggest contributions were of the form "prove a theorem analogous to existing technique X but for somewhat-different problem Y". It's not obvious to me how much technically-new knowledge really requires completely-conceptually-new "extrapolation" of ideas.
On the gripping hand, I'm steelmanning so hard in my first paragraph that it no longer really resembles the real clearly-stated AI-dismissive arguments. If we actually get to the point where the output of an LLM can predict or surpass any top human, I'm going to need to see some much clearer proofs that the Church-Turing thesis only constrains semiconductors, not fatty grey meat. Well, I'd like to see such proofs, anyway. If we get to that point then any proof attempts are likely either going to be comically silly (if we have Friendly AGI, it'll be shooting them down left and right) or tragically silly (if we have UnFriendly AGI, hopefully we won't keep debating whether submarines can really swim while they're launching torpedos).
More options
Context Copy link
More options
Context Copy link
We just had a certain somebody on this forum bring up current LLMs being bad at chess as an example. He even got an AAQC for it (one of the most bemusing ones every awarded).
I do not recall him acknowledging my point where I pointed out that I GPT 3.5 Turbo played at ~1800 ELO, and that the decline was likely because AI engineers made the eminently sensible realization that just about nobody is stupid enough to use LLMs to play chess. If they do, then the LLMs know how to use Stockfish, in the same way they can operate a calculator.
You are free to ping me if you like. You know that right?
I would, if I felt like I gained anything out of it. As it is, the previous thread has only contributed to early hair loss. You still haven't noted any of the clear and correct objections I've raised, and I'm tired of asking.
Giving you an answer you dislike or disagree with is not the same thing as not giving you an answer.
I argued, and I have continued to argue in this thread, that agentic behavior and general applicability are core components of what it is to be "intelligent". Yes a a pocket calculator is orders of magnitude better at arithmetic than any human and stockfish is better at chess, that doesn't make either of them "intelligences" does it?
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
Huh. I was confident that I had a better writeup about why "stochastic parrots" are a laughable idea, at least as a description for LLMs. But no, after getting a minor headache figuring out the search operators here, it turns out that's all I've written on the topic.
I guess I never bothered because it's a Gary Marcus-tier critique, and anyone using it loses about 20 IQ points in my estimation.
But I guess now is as good a time as any? In short, it is a pithy, evocative critique that makes no sense.
LLMs are not inherently stochastic. They have a (not usually exposed to end-user except via API) setting called temperature. Without going into how that works, it suffices it to say that by setting the value to zero, their output becomes deterministic. The exact same prompt gives the exact same output.
The reason why temperature isn't just set to zero all the time is because the ability to choose something other than the next most likely token has benefits when it comes to creativity. At the very least it saves you from getting stuck with the same subpar result.
Alas, this means that LLMs aren't stochastic parrots. Minus the stochasticity, are they just "parrots"? Anyone thinking this is on crack, since Polly won't debug your Python no matter how many crackers you feed her.
If LLMs were merely interpolating between memorized n-grams or "stitching together" text, their performance would be bounded by the literal contents of their training data. They would excel at retrieving facts and mimicking styles present in the corpus, but would fail catastrophically at any task requiring genuine abstraction or generalization to novel domains. This is not what we observe.
Let’s get specific. The “parrot” model implies the following:
LLMs can only repeat (paraphrase, interpolate, or permute) what they have seen.
They lack generalization, abstraction, or true reasoning.
They are, in essence, Markov chains with steroids.
To disprove any of those claims, just gestures angrily look at the things they can do. If winning gold in the latest IMO is something a "stochastic parrot" can pull off, then well, the only valid takeaway is that the damn parrot is smarter than we thought. Definitely smarter than the people who use the phrase unironically.
The inventors of the phrase, Bender & Koller gave two toy “gotchas” that they claimed no pure language model could ever solve: (1) a short vignette about a bear chasing a hiker, and (2) the spelled-out arithmetic prompt “Three plus five equals”. GPT-3 solved both within a year. The response? Crickets, followed by goal-post shifting: “Well, it must have memorized those exact patterns.” But the bear prompt isn’t in any training set at scale, and GPT-3 could generalize the schema to new animals, new hazards, and new resolutions. Memorization is a finite resource but generalization is not.
(I hope everyone here recalls that GPT-3 is ancient now)
On point 2: Consider the IMO example. Or better yet, come up with a rigorous definition of reasoning by which we can differentiate a human from an LLM. It's all word games, or word salad.
On 3: Just a few weeks back, I was trying to better understand the actual difference between a Markov Chain and an LLM, and I had asked o3 if it wasn't possible to approximate the latter with the former. After all, I wondered, if MCs only consider the previous unit (usually words, or a few words/n-gram), then couldn't we just train the MC to output the next word conditioned on every word that came before? The answer was yes, but that this was completely computationally intractable. The fact that we can run LLMs on something smaller than a Matrioshka brain is because of their autoregressive nature, and the brilliance of the transformer architecture/attention mechanism.
Overall, even the steelman interpretation of the parrot analogy is only as helpful as this meme, which I have helpfully appended below. It is a bankrupt notion, a thought-terminating cliché at best, and I wouldn't cry if anyone using it meets a tiger outside the confines of a cage.
/images/17544215520465958.webp
I liked using the stochastic parrot idea as a shorthand for the way most of the public use llms. It gives non-computer savvy people a simple heuristic that greatly elevates their ability to use them. But having read this I feel a bit like Charlie and Mac when the gang wrestles.
I would consider myself an LLM evangelist, and have introduced quite a few not-particularly tech savvy people to them, with good results.
I've never been tempted to call them stochastic parrots. The term harms more than it helps. My usual shortcut is to tell people to act as if they're talking to a human, a knowledgeable but fallible one, and they should double check anything of real consequence. This is a far more relevant description of the kind of capabilities they possess than any mention of a "parrot".
The fact you've never been tempted to use the 'stochastic parrot' idea just means you haven't dealt with the specific kind of frustration I'm talking about.
Yeah the 'fallible but super intelligent human' is my first shortcut too, but it actually contributes to the failure mode the stochastic parrot concept helps alleviate. The concept is useful for those who reply 'Yeah, but when I tell a human they're being an idiot, they change their approach.' For those who want to know why it can't consistently generate good comedy or poetry. For people who don't understand rewording the prompt can drastically change the response, or those who don't understand or feel bad about regenerating or ignoring the parts of a response they don't care about like follow up questions.
In those cases, the stochastic parrot is a more useful model than the fallible human. It helps them understand they're not talking to a who, but interacting with a what. It explains the lack of genuine consciousness, which is the part many non-savvy users get stuck on. Rattling off a bunch of info about context windows and temperature is worthless, but saying "it's a stochastic parrot" to themselves helps them quickly stop identifying it as conscious. Claiming it 'harms more than it helps' seems more focused on protecting the public image of LLMs than on actually helping frustrated users. Not every explanation has to be a marketing pitch.
I still don't see why that applies, and I'm being earnest here. What about the "stochastic parrot" framing keys the average person into the fact that they're good at code and bad at poetry? That is more to do with mode collapse and the downsides of RLHF than it is to do with lacking "consciousness". Like, even on this forum, we have no shortage of users who are great at coding but can't write a poem to save their lives, what does that say about their consciousness? Are parrots known to be good at Ruby-on-rails but to fail at poetry?
My explanation of temperature is, at the very least, meant as a high level explainer. It doesn't come up in normal conversation or when I'm introducing someone to LLMs. Context windows? They're so large now that it's not something that is worth mentioning except in passing.
My point is that the parrot metaphor adds nothing. It is, at best, irrelevant, when it comes to all the additional explainers you need to give to normies.
I thought I explained it pretty well, but I will try again. It is a cognitive shortcut, a shorthand people can use when they are still modelling it like a 'fallible human' and expecting it to respond like a fallible human. Mode collapse and RLHF have nothing to do with it, because it isn't a server side issue, it is a user issue, the user is anthropomorphising a tool.
Yes, temperature and context windows (although I actually meant to say max tokens, good catch) don't come up in normal conversation, they mean nothing to a normie. When a normie is annoyed that chatgpt doesn't "get" them, the parrot model helps them pivot from "How do I make this understand me?" to "What kind of input does this tool need to give me the output I want?"
You can give them a bunch of additional explanations about mode collapse and max tokens that they won't understand (and they will just stop using it) or you can give them a simple concept that cuts through the anthropomorphising immediately so that when they are sitting at their computer getting frustrated at poor quality writing or feeling bad about ignoring the llms prodding to take the conversation in a direction they don't care about, they can think 'wait it's a stochastic parrot' and switch gears. It works.
A human fails at poetry because it has the mind, the memories and grounding in reality, but it lacks the skill to match the patterns we see as poetic. An LLM has the skill, but lacks the mind, memories and grounding in reality. What about the parrot framing triggers that understanding? Memetics I guess. We have been using parrots to describe non-thinking pattern matchers for centuries. Parroting a phrase goes back to the 18th century. "The parrot can speak, and yet is nothing more than a bird" is a phrase in the ancient Chinese Book of Rites.
Also I didn't address this earlier because I thought it was just amusing snark, but you appear to be serious about it. Yes, you are correct that a parrot can't code. Do you have a similar problem with the fact a computer virus can't be treated with medicine? Or that the cloud is actually a bunch of servers and can't be shifted by the wind? Or the fact that the world wide web wasn't spun by a world wide spider? Attacking a metaphor is not an argument.
I've explained why I think the parrot is a terrible metaphor above. And no, metaphors can vary greatly in how useful or pedagogical they are. Analyzing the fitness of a metaphor is a perfectly valid, and in this case essential, form of argument. Metaphors are not neutral decorations; they are cognitive tools that structure understanding and guide action.
A computer virus shares many properties with its biological counterpart, such as self-replication, transmission, damage to systems, the need for an "anti-virus". It is a good name, and nobody with a functional frontal lobe comes away thinking they need an N95 mask while browsing a porn site.
The idea of the Cloud at least conveys the message that the user doesn't have to worry about the geographical location of their data. Even so, the Cloud is just someone else's computer, and even AWS goes down on rare occasions. It is an okay metaphor.
The Parrot is awful. It offers no such explanatory power for the observed, spiky capability profile of LLMs. It does not explain why the model can write functional Python code (a task requiring logic and structure) but often produces insipid poetry (a task one might think is closer to mimicry). It does not explain why an LLM can synthesize a novel argument from disparate sources but fail to count the letters in a word. A user equipped only with the parrot model is left baffled by these outcomes. They have traded the mystery of a "fallible human" for the mystery of a "magical parrot".
I contend that as leaky generalizations go, the former is way better than the latter. An LLM has a cognitive or at least behavioral profile far closer to a human than it does to a parrot.
You brought up the analogy of "parroting" information, which I would assume involves simply reciting things back without understanding what they mean. That is not a good description of how the user can expect an LLM to behave.
On an object level, I strong disagree with your claims that LLMs don't "think" or don't have "minds". They clearly have a very non human form of cognition, but so does an octopus.
Laying that aside, from the perspective of an end-user, LLMs are better modeled as thinking minds.
The "fallible but knowledgeable intern" or "simulation engine" metaphor is superior not because it is more technically precise (though it is), but because it is more instrumentally useful. It correctly implies the user's optimal strategy: that performance is contingent on the quality of the instructions (prompting), the provided background materials (context), and a final review of the output (verification). This model correctly guides the user to iterate on their prompts, to provide examples, and to treat the output as a draft. The parrot model, in contrast, suggests the underlying process is fundamentally random mimicry, which offers no clear path to improvement besides "pull the lever again". It encourages users to conceptualize the LLM as a tool incapable of generalization, which is to ignore its single most important property. Replacing a user's anthropomorphism with a model that is descriptively false and predictively useless is not a pedagogical victory. It is swapping one error for another, and not even for a less severe one to boot.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
Computationally, maybe all we are is Markov chains. I'm not sold, but Markov chat bots have been around for a few decades now and used to fool people occasionally even at smaller scales.
LLMs can do pretty impressive things, but I haven't seen convincing evidence that any of them have stepped clearly outside the bounds of their training dataset. In part that's hard to evaluate because we've been training them on everything we can find. Can a LLM trained on purely pre-Einstein sources adequately discuss relativity? A human can be well versed in lots of things with substantially less training material.
I still don't think we have a good model for what intelligence is. Some have recently suggested "compression", which is interesting from an information theory perspective. But I won't be surprised to find that whatever it is, it's actually an NP-hard problem in the perfect case, and everything else is just heuristics and approximations trying to be close. In some ways it'd be amusing if it turns out to be a good application of quantum computing.
What does it mean to step outside the bounds of their training set? If I have it write a fanfic about Saruman being sponsored by NordVPN for a secure Palantir browsing experience (first month is free with code ISTARI), is that beyond the training set? It knows about NordVPN and Lord of the Rings but surely there is no such combo in the training set.
Or would it be novel if I give it my python code and errors from the database logs and ask it for a fix? My code specifically has never been trained on, though it's seen a hell of a lot of python.
R1 has seen use in writing kernels which is real work for AI engineers, is that novel? Well it's seen a bunch of kernels in the past.
Or something fundamentally new like a paradigm-changer like the transformer architecture itself or a whole new genre of fiction? If it's that, then we'd only get it at the point of AGI.
More options
Context Copy link
I don't want to speak on 'intelligence' or genuine reasoning or heuristics and approximations, but when it comes to going outside the bounds of their training data, it's pretty trivially possible to take an LLM and give a problem related to a video game (or a mod for a video game) that was well outside of its knowledge cutoff or training date.
I can't test this right now, it's definitely not an optimal solution (see uploaded file for comparison), and I think it misinterpreted the Evanition operator, but it's a question that I'm pretty sure didn't have an equivalent on the public web anywhere until today. There's something damning in getting a trivial computer science problem either non-optimal or wrong, especially when given the total documentation, but there's also something interesting in getting one like this close at all with such minimum of information.
/images/17544296446888535.webp
What on earth is going on in that screenshot? I know Minecraft mod packs can get wild, but that's new.
HexCasting is fun, if not very balanced.
It has a stack-based programming language system based on drawing Patterns onto your screen over a hex-style grid, where each Pattern either produces a single variable on the top of the stack, manipulates parts of the stack to perform certain operations, or act as an escape character, with one off-stack register (called the Ravenmind). You can keep the state of the grid and stack while not actively casting, but because the screen grid has limited space and the grid is wiped whenever the stack is empty (or on shift-right-click), there's some really interesting early-game constraints where quining a spell or doing goofy recursion allows some surprisingly powerful spells to be made much earlier than normal.
Eventually, you can craft the Focus and Spellbook items that can store more variables from the stack even if you wipe the grid, and then things go off the rails very quickly, though there remain some limits since most Patterns cost amethyst from your inventory (or, if you're out of amethyst and hit a certain unlock, HP).
Like most stack-based programming it tends to be a little prone to driving people crazy, which fits pretty heavily with the in-game lore for the magic.
That specific spell example just existed to show a bug in how the evaluator was calculating recursion limits. The dev intended to have a limit of 512 recursions, but had implemented two (normal) ways of recursive casting. Hermes' Gambit executes a single variable from the stack, and each Hermes' added one to the recursion limit as it was executed. Thoth's Gambit executes each variable from one list over a second list, and didn't count those multiplicatively. I think it was only adding one to the recursion for each variable in the second list? Since lists only took 1 + ListCount out of the stack's 1024 limit to the stack, you could conceivably hit a quarter-million recursions without getting to the normal block from the limit.
Psuedocode, it's about equivalent to :
Very ugly, but the language is intentionally constrained so you can't do a lot of easier approaches (eg, you have to declare 10^3 because the symbol for 1000 is so long it takes up most of the screen, you don't have normal for loops so that abomination of a list initialization is your new
worst enemybest friend, every number is a double).Not that big a deal when you're just printing to the screen, but since those could (more!) easily have been explosions or block/light placements or teleportations, it's a bit scary for server owners.
((In practice, even that simple counter would cause everyone to disconnect from a remote server. Go go manual forkbomb.))
For some other example spells, see a safe teleport, or a spell to place a series of temporary blocks the direction you're looking, or to mine five blocks from the face of the block you're looking at.
(Magical)PSI is a little easier to get into and served as the inspiration for HexCasting, but it has enough documentation on reddit that I can't confidently say it's LLM-training proof.
Why am I surprised? People make Redstone computers for fun. I guess this all just takes a very different mindset haha.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
That is pretty impressive. Is it allowed to search the web? It looks like it might be. I think the canonical test I'm proposing would disallow that, but it is a useful step in general.
Huh.
Uploading just the Patterns section of the HexBook webpage and disabling search on web looks better even on Grok3, though that's just a quick glance and I won't be able to test it for a bit.EDIT: nope, several hallucinated patterns on Grok 3, including a number that break from the naming convention. And Grok4 can't have web search turned off. Bah.
Have you tried simply asking it not to search the web? The models usually comply when asked. If they don't, it should be evident from the UI.
That's a fair point, and does seem to work with Grok, as does just giving it only one web page and requesting it to not use others. Still struggles, though.
That said, a lot of the logic 'thinking' steps are things like "The summary suggests list operations exist, but they're not fully listed due to cutoff.", getting confused by how Consideration/Introspection works (as start/end escape characters) or trying to recommend Concat Distillation, which doesn't exist but is a reasonable (indeed, the code) name for Speaker's Distillation. So it's possible I'm more running into issues with the way I'm asking the question, such that Grok's research tooling is preventing it from seeing the necessary parts of the puzzle to find the answer.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link