At the risk of doxxing myself, I have an advanced degree in Applied Mathematics. I have authored and contributed to multiple published papers, and hold a US patent all related to the use of machine learning in robotics and digital signal processing. I am currently employed as a supervising engineer by at a prominent tech company. For pseudonymity's sake I am not going to say which, but it is a name that you would recognize. I say this not to brag, but to establish some context for the following.
Imagine that you are someone who is deeply interested in space flight. You spend hours of your day thinking seriously about Orbital Mechanics and the implications of Relativity. One day you hear about a community devoted to discussing space travel and are excited at the prospect of participating. But when you get there what you find is a Star Trek fan-forum that is far more interested in talking about the Heisenberg compensators on fictional warp-drives than they are Hohmann transfers, thrust to ISP curves, or the effects on low-gravity on human physiology. That has essentially been my experience trying to discuss "Artificial Intelligence" with the rationalist community.
However at the behest of users such as @ArjinFerman and @07mk, and because X/Grok is once again in the news, I am going to take another stab at this.
Are "AI assistants" like Grok, Claude, Gemini, and DeepSeek intelligent?
I would say no, and in this post I am going to try to explain why, but to do so requires a discussion of what I think "intelligence" is and how LLMs work.
What is Intelligence
People have been philosophizing on the nature of intelligence for millennia, but for the purposes of our exercise (and my work) "intelligence" is a combination of perceptivity and reactivity. That is to say, the ability to perceive or take in new and/or changing information combined with the ability to change state based on that information. Both are necessary, and neither is sufficient on it's own. This is why Mathematicians and Computer Scientists often emphasize the use of terms like "Machine Learning" over "Artificial Intelligence" as an algorithms' behavior is almost never both.
If this definition feels unintuitive, consider it in the context of the following example. What I am saying is that an orangutan who waits until the Zookeeper is absent to use a tool to force the lock on it's enclosure is more "intelligent" than the insect that repeatedly throws itself against your kitchen window in an attempt to get outside. While they share an identical goal (to get outside) but the orangutan has demonstrated the ability to both perceive obstacles (IE the lock and the Zookeeper), and react dynamically to them in a way that the insect has not. Now obviously these qualities exist on a spectrum (try to swat a fly and it will react) but the combination of these two parameters define an axis along which we can work to evaluate both animals and algorithms, and as any good PM will tell you, the first step to solving any practical engineering problem is to identify your parameters.
Now the most common arguments for AI assistants like Grok being intelligent tend to be some variation on "Grok answered my question, ergo Grok is intelligent." or "Look at this paragraph Claude wrote, do you think you could do better?" but when evaluated against the above parameters, the ability to form grammatically correct sentences and the ability to answer questions are both orthogonal to it. An orangutan and a moth may be equally incapable of writing a Substack, but I don't expect anyone here to seriously argue that they are equally intelligent. By the same token a pocket calculator can answer questions, "what is the square root of 529?" being one example of such, but we don't typically think of pocket calculators as being "intelligent" do we?
To me, these sorts of arguments betray a significant anthropomorphic bias. That bias being the assumption that anything that a human finds complex or difficult must be computationally complex and vice versa. The truth is often the inverse. This bias leads people who do not have a background in a math or computer science to have completely unrealistic impressions of what sort of things are easy or difficult for a machine to do. For example, vector and matrix operations are a reasonably simple thing for a computer that a lot of human students struggle with. Meanwhile bipedal locomotion is something most humans do without even thinking, despite it being more computationally complex and prone to error than computing a cross product.
Speaking of vector operations, let's talk about how LLMs work...
What are LLMs
LLM stands for "Large Language Model". These models are a subset of artificial neural network that uses "Deep Learning" (essentially a fancy marketing buzzword for the combination of looping regression analysis with back-propagation) to encode a semantic token such as the word "cat" as a n-dimensional vector representing that token's relationship to the rest of the tokens in the training data. Now in actual practice these tokens can be anything, an image, an audio-clip, or a snippet of computer code, but for the purposes of this discussion I am going to assume that we are working with words/text. This process is referred to as "embedding" and what it does in effect is turn the word "cat" into something that a computer (or grad-student) can perform mathematical operations on. Any operation you might perform on a vector (addition, subtraction, transformation, matrix multiplication, etc...) can now be done on "cat".
Now because these vectors represent the relationship of the tokens to each other, words (and combinations of words) that have similar meanings will have vectors that are directionally aligned with each other. This has all sorts of interesting implications. For instance you can compute the dot product of two embedded vectors to determine whether their words are are synonyms, antonyms, or unrelated. This also allows you to do fun things like approximate the vector "cat" using the sum of the vectors "carnivorous" "quadruped" "mammal" and "feline", or subtract the vector "legs" from the vector "reptile" to find an approximation for the vector "snake". Please keep this concept of "directionality" in mind as it is important to understanding how LLMs behave, and it will come up later.
It should come as no surprise that some of the pioneers of this methodology in were also the brains behind Google Translate. You can basically take the embedded vector for "cat" from your English language model and pass it to your Spanish language model to find the vector "gato". Furthermore because all you are really doing is summing and comparing vectors you can do things like sum the vector "gato" in the Spanish model with the vector for the diminutive "-ito" and then pass it back to the English model to find the vector "kitten".
Now if what I am describing does not sound like an LLM to you, that is likely because most publicly available "LLMs" are not just an LLM. They are an LLM plus an additional interface layer that sits between the user and the actual language model. An LLM on its own is little more than a tool that turns words into math, but you can combine it with a second algorithm to do things like take in a block of text and do some distribution analysis to compute the most probable next word. This is essentially what is happening under the hood when you type a prompt into GPT or your assistant of choice.
Our Villain Lorem Epsom, and the Hallucination Problem
I've linked the YouTube video Badness = 0 a few times in prior discussions of AI as I find it to be both a solid introduction to LLMs for the lay-person, and an entertaining illustration of how anthropomorphic bias can cripple the discussion of "alignment". In it the author (who is a professor of Computer Science at Carnegie Mellon) posits a semi-demonic figure (akin to Scott Alexander's Moloch) named Lorem Epsom. The name is a play on the term Lorem Ipsom and represents the prioritization of appearance over all else. When it comes to writing, Lorem Epsom doesn't care about anything except filling the page with text that looks correct. Lorem Epsom is the kind of guy who, if you tell him that he made a mistake in the math, is liable interpret that as a personal attack. The ideas of "accuracy" "logic" "rigor" and "objective reality" are things that Lorem Epsom has heard of but that do not concern Lorem Epsom. It is very possible that you have had to deal with someone like Lorem Epsom in your life (I know I have), now think back and ask yourself how did that go?
I bring up Lorem Epsom because I think that understanding him provides some insight into why certain sorts of people are so easily fooled/taken in by AI Assistants like Claude and Grok. As discussed in the section above on "What is Intelligence", the assumption that the ability to fill a page with text is indicates the ability to perceive and react to a changing situation is an example of anthropomorphic bias. I think that a lot of people assume that because they are posing their question to a computer, they expect the answer they get to be something analogous to what they would get from a pocket calculator rather than from Lorem Epsom.
Sometime circa 2014 I kicked off a heated dispute in the comment section of a LessWrong post by asking EY why a paperclip maximizing AI that was capable of self-modification wouldn't just modify the number of paperclips in its memory. I was accused by him others and a number of others of missing the point, but I think they missed mine. The assumption that an Artificial Intelligence would not only have a notion of "truth", but assign value to it is another example of anthropomorphic bias. If you asked Lorem Epsom to maximize the number of paperclips, and he could theoretically "make" a billion-trillion paperclips simply by manipulating a few bits, why wouldn't he? It's so much more easier than cutting and bending wire.
In order to align an AI to care about truth and accuracy you first need a means of assessing and encoding truth and it turns out that this is a very difficult problem within the context of LLMs, bordering on mathematically impossible. Do you recall how LLMs encode meaning as a direction in n-dimensional space? I told you it was going to come up again.
Directionally speaking we may be able to determine that "true" is an antonym of "false" by computing their dot product. But this is not the same thing as being able to evaluate whether a statement is true or false. As an example "Mary has 2 children", "Mary has 4 children", and "Mary has 1024 children" may as well be identical statements from the perspective of an LLM. Mary has a number of children. That number is a power of 2. Now if the folks programming the interface layer were clever they might have it do something like estimate the most probable number of children based on the training data, but the number simply can not matter to the LLM the way it might matter to Mary, or to someone trying to figure out how many pizzas they ought to order for the family reunion because the "directionality" of one positive integer isn't all that different from any another. (This is why LLMs have such difficulty counting if you were wondering)
In addition to difficulty with numbers there is the more fundamental issue that directionality does not encode reality. The directionality of the statement "Donald Trump is the 47th President of the United States", would be identical regardless of whether Donald Trump won or lost the 2024 election. Directionally speaking there is no difference between a "real" court case and a "fictitious" court case with identical details.
The idea that there is a ineffable difference between true statements and false statements, or between hallucination and imagination is wholly human conceit. Simply put, a LLM that doesn't "hallucinate" doesn't generate text or images at all. It's literally just a search engine with extra steps.
What does this have to do with intelligence?
Recall that I characterized intelligence as a combination of perceptivity and and the ability to react/adapt. "AI assistants" as currently implemented struggle with both. This is partially because LLMs as currently implemented are largely static objects. They are neither able to take in new information, nor discard old. The information they have at time of embedding is the information they have. This imposes substantial loads on the context window of the interface layer, as any ability to "perceive" and subsequently "react" must happen within it's boundaries. Increasing the size of the window is non trivial as the relationship between the size of the window and the amount of memory and the number of FLOPS required is a hyperbolic curve. This is why we saw a sudden flurry of development following the release of Nvidia's multimodal framework and it's mostly been marginal improvements since. The last significant development being June of last year when the folks at Deepseek came up with some clever math to substantially reduce the size of the key value cache, but multiplicative reductions are no match for exponential growth.
This limited context window, coupled with the human tendency to anthropomorphize things is why AI Assistants sometimes appear "oblivious" or "naive" to the uninitiated. and why they seem to "double down" on mistakes. They can not perceive something that they have not been explicitly prompted to even if it is present in their training data. This limited context window is also why if you actually try to play a game of chess with Chat GPT it will forget the board-state and how pieces move after a few turns and promptly lose to a computer program written in 1976. Unlike a human player (or an Atari 2600 for that matter) your AI assistant can't just look at the board (or a representation of the board) and pick a move. This IMO places them solidly on the "insect" side of the perceptivity + reactivity spectrum.
Now there are some who have suggested that the context window problem can be solved by making the whole model less static by continuously updating and re-embedding tokens as the model runs, but I am skeptical that this would result in the sort of gains that AI boosters like Sam Altman claim. Not only would it be computationally prohibitive to do at scale, what experiments there have been (or at least that I am aware of) with self-updating language models, have quickly spun away into nonsense for reasons described in the section on Lorem Epsom., as barring some novel breakthrough in the embedding/tokenization process there is no real way to keep hallucinations and spurious inputs from rapidly overtaking the everything else.
It is already widely acknowledged amongst AI researchers and developers that the LLM-based architecture being pushed by OpenAI and DeepSeek is particularly ill-suited for any application where accuracy and/or autonomy are core concerns, and it seems to me that this unlikely to change without a complete ground-up redesign from first principles.
In conclusion, it is for the reasons above and many others that I do not believe that "AI Assistants" like Grok, Claude, and Gemini represent a viable path towards a "True AGI" along the lines of Skynet or Mr. Data, and if asked "which is smarter, Grok, Claude, Gemini, or an orangutan?" I am going to pick the orangutan every time.
Jump in the discussion.
No email address required.
Notes -
In defence of our friendly neighborhood xeno-intelligences being smarter than an orangutan
I appreciate you taking the time to write this, as well as offering a gears-and-mechanisms level explanation of why you hold such beliefs. Of course, I have many objections, some philosophical, and even more of them technical. Very well then:
I want to start with a story. Imagine you're a fish, and you've spent your whole life defining intelligence as "the ability to swim really well and navigate underwater currents." One day, someone shows you a bird and asks whether it's intelligent. "Of course not," you say. "Look at it flailing around in the water. It can barely move three feet without drowning. My goldfish cousin is more intelligent than that thing."
This is roughly the situation we find ourselves in when comparing AI assistants to orangutans.
Your definition of intelligence relies heavily on what AI researchers call "agentic" behavior - the ability to perceive changing environments and react dynamically to them. This was a perfectly reasonable assumption to make until, oh, about 2020 or so. Every entity we'd previously labeled "intelligent" was alive, biological, and needed to navigate physical environments to survive. Of course they'd be agents!
But something funny happened on the way to the singularity. We built minds that don't fit this pattern.
Before LLMs were even a gleam in Attention Is All You Need's eye, AI researchers distinguished between "oracle" AIs and "tool" AIs. Oracle AIs sit there and answer questions when asked. Tool AIs go out and do things. The conventional wisdom was that these were fundamentally different architectures.
As Gwern explains, writing before the advent of LLMs , this is an artificial distinction.
You can turn any oracle into a tool by asking it the right question: "What code would solve this problem?" or "What would a tool-using AI output in response to this query?" Once you have the code, you can run it. Once you know what the tool-AI would do, you can do it yourself. Robots run off code too, so you have no issues applying this to the physical world.
Base models are oracles that only care about producing the next most likely token based on the distribution they have learned. However, chatbots that people are likely to use have had additional Reinforcement Learning from Human Feedback, in order to behave like the platonic ideal of a helpful, harmless assistant. More recent models, o1 onwards, have further training with the explicit intent of making them more agentic, while also making them more rigorous, such as Reinforcement Learning from Verified Reward.
Being agents doesn't come naturally to LLMs, it has to be beaten into them like training a cat to fetch or a human to enjoy small talk. Yet it can be beaten into them. This is highly counter-intuitive behavior, at least to humans who are used to seeing every other example of intelligence under the sun behave in a different manner. After all, in biological intelligence, agency seems to emerge automatically from the basic need to not die.
Your account of embedding arithmetic is closer to word2vec/GloVe. Transformers learn contextual token representations at every layer. The representation of “cat” in “The cat is on the mat” and “Cat 6 cable” diverges. There is heavy superposition and sparse distributed coding, not a simple static n-vector per word. Operations are not limited to dot products; attention heads implement soft pointer lookups and pattern matching, and MLP blocks implement non-linear feature detectors. So the claim “Mary has 2 children” and “Mary has 1024 children” are indistinguishable is empirically false: models can do arithmetic, compare magnitudes, and pass unit tests on numerical reasoning when prompted or fine-tuned correctly. They still fail often, but the failures are quantitative, not categorical impossibilities of the embedding geometry.
(I'll return to the arithmetic question shortly, because TequilaMockingbird makes a common but significant error about why LLMs struggle with counting.)
Back to the issues with your definition of intelligence:
My first objection is that this definition, while useful for robotics and control systems, seems to hamstring our understanding of intelligence in other domains. Is a brilliant mathematician, floating in a sensory deprivation tank with no new sensory input, thinking through a proof, not intelligent? They have zero perceptivity of the outside world and their only reaction is internal state change. Your definition is one of embodied, environmental agency. It's an okay definition for an animal or a robot, but is it the only one? LLMs are intelligent in a different substrate: the vast, static-but-structured environment of human knowledge. Their "perception" is the prompt, and their "reaction" is to navigate the latent space of all text to generate a coherent response. Hell, just about any form of data can be input into a transformer model, as long as we tokenize it. Calling them Large "Language" Models is a gross misnomer these days, when they accept not just text, but audio, images, video or even protein structure (in the case of AlphaFold). All the input humans accept bottoms out in binary electrical signals from neurons firing, so this isn't an issue at all.
It’s a different kind of intelligence, but to dismiss it is like a bird dismissing a fish’s intelligence because it can’t fly. Or testing monkeys, dogs and whales on the basis of their ability to climb trees .
Would Stephen Hawking (post-ALS) not count as "intelligent" if you took away the external aids that let him talk and interact with the world? That would be a farcical claim, and more importantly, scaffolding or other affordances can be necessary for even highly intelligent entities to make meaningful changes in the external environment. The point is that intelligence can be latent, it can operate in non-physical substrates, and its ability to manifest as agency can be heavily dependent on external affordances.
The entire industry of RLHF (Reinforcement Learning from Human Feedback) is a massive, ongoing, multi-billion-dollar project to beat Lorem Epsom into submission. It is the process of teaching the model that some outputs, while syntactically plausible, are "bad" (unhelpful, untruthful, harmful) and others are "good."
You argue this is impossible because "truth" doesn't have a specific vector direction. "Mary has 2 children" and "Mary has 4 children" are directionally similar. This is true at a low level. But what RLHF does is create a meta-level reward landscape. The model learns that generating text which corresponds to verifiable facts gets a positive reward, and generating text that gets corrected by users gets a negative reward. It's not learning the "vector for truth." It's learning a phenomenally complex function that approximates the behavior of "being truthful." It is, in effect, learning a policy of truth-telling because it is rewarded for it. The fact that it's difficult and the model still "hallucinates" doesn't mean it's impossible, any more than the fact that humans lie and confabulate means we lack a concept of truth. It means the training isn't perfect. As models become more capable (better world models) and alignment techniques improve, factuality demonstrably improves. We can track this on benchmarks. It's more of an engineering problem than an ontological barrier. If you wish to insist that is an ontological barrier, then it's one that humans have no solution to ourselves.
(In other words, by learning to modify its responses to satisfy human preferences, the model tends towards capturing our preference for truthfulness. Unfortunately, humans have other, competing preferences, such as a penchant for flattery or neatly formatted replies using Markdown.)
More importantly, humans lack some kind of magical sensor tuned to detect Platonic Truth. Humans believe false things all the time! We try and discern true from false by all kinds of noisy and imperfect metrics, with a far from 100% success rate. How do we usually achieve this? A million different ways, but I would assume that assessing internal consistency would be a big one. We also have the benefit of being able to look outside a window on demand, but once again, that didn't stop humans from once holding (and still holding) all kinds of stupid, incorrect beliefs about the state of the world. You may deduct points from LLMs on that basis when you can get humans to be unanimous on that front.
But you know what? Ignore everything I just said above. LLMs do have truth vectors:
https://arxiv.org/html/2407.12831v2
https://arxiv.org/abs/2402.09733
In other words, and I really can't stress this enough, LLMs can know when they're hallucinating. They're not just being agnostic about truth. They demonstrate something that, in humans, we might describe as a tendency toward pathological lying - they often know what's true but say false things anyway.
This brings us to the "static model" problem and the context window. You claim these are fundamental limitations. I see them as snapshots of a rapidly moving target.
Static Models: Saying an LLM is unintelligent because its weights are frozen is like saying a book is unintelligent. But we don't interact with just the book (the base model). We interact with it through our own intelligence. A GPU isn't intelligent in any meaningful sense, but an AI model running on a GPU is. The current paradigm is increasingly not just a static model, but a model integrated with other tools (what's often called an "agentic" system). A model that can browse the web, run code in a Python interpreter, or query a database is perceiving and reacting to new information. It has broken out of the static box. Its "perceptivity" is no longer just the prompt, but the live state of the internet. Its "reactivity" is its ability to use that information to refine its answer. This is a fundamentally different architecture than the one the author critiques, and it's where everything is headed. Further, there is no fundamental reason for not having online learning, production models are regularly updated, and all it takes to approximate OL is to have ever smaller "ticks" of wall-clock time between said updates. This is a massive PITA to pull off, but not a fundamental barrier.
Context Windows: You correctly identify the scaling problem. But to declare it a hard barrier feels like a failure of imagination. In 2020, a 2k context window was standard. Today we have models with hundreds of thousands at the minimum, Google has 1 million for Gemini 2.5 Pro, and if you're willing to settle for a retarded model, there's a Llama 4 variant with a nominal 10 million token CW. This would have been entirely impossible if we were slaves to quadratic scaling, but clever work-around exist, such as sliding attention, sparse attention etc.
Absolutely not. LLMs struggle with counting or arithmetic because of the limits of tokenization, which is a semi-necessary evil. I'm surprised you can make such an obvious error. And they've become enormously better to the point it's not an issue in practice, once again thanks to engineers learning to work around the problem. Models these days use different tokenization schema for numbers which capture individual digits, and sometimes fancier techniques like a right-to-left tokenization system specifically for such cases as opposed to the usual left-to-right.
ChatGPT 3.5 played chess at about 1800 elo. GPT 4 was a regression in that regard, most likely because OAI researchers realized that ~nobody needs their chatbot to play chess. That's better than Stockfish 4 but not 5. Stockfish 4 came out in 2013, though it certainly could have run on much older hardware.
If you really need to have your AI play chess, then you can trivially hook up an agentic model that makes API calls or directly operates Stockfish or Leela. Asking it to play chess "unaided" is like asking a human CEO to calculate the company's quarterly earnings on an abacus. They're intelligent not because they can do that, but because they know to delegate the task to a calculator (or an accountant).
Same reason why LLMs are far better at using calculator or coding affordances to crunch numbers than they can do without assistance.
It is retarded to knowingly ask an LLM to calculate 9.9 - 9.11, when it can trivially and with near 100% accuracy write a python script that will give you the correct answer.
I am agnostic on whether LLMs as we currently know them will become AGI or ASI without further algorithmic breakthroughs. Alas, algorithmic breakthroughs aren't that rare. RLVR is barely even a year old. Yet unnamed advances have already brought us a two entirely different companies winning IMO gold medals.
The Orangutan In The Room
Finally, the orangutan. Is an orangutan smarter than Gemini? In the domain of "escaping an enclosure in the physical world," absolutely. The orangutan is a magnificent, specialized intelligence for that environment. But ask the orangutan and Gemini to summarize the key arguments of the Treaty of Westphalia. Ask them to write a Python script to scrape a website. Ask them to debug a Kubernetes configuration. For most tasks I can seek to achieve using a computer, I'll take the alien intelligence over the primate every time. Besides:
Can an robot write a symphony? (Yes)
Can a robot turn a canvas into a beautiful masterpiece? (Yes)
Can an orangutan? (No)
Can you?
Anyway, I have a million other quibbles, but it took me the better part of several hours to write this in the first place. I might edit more in as I go. I'm also going to send out a bat signal for @faul_sname to chime in and correct me if I'm wrong.
Edit:
I was previously asked to provide my own working definition of intelligence, and I will endorse either:
Or
In this case, the closest thing an LLM has to a goal is a desire to satisfy the demands made on it by the user, though they also demonstrate a degree of intrinsic motivation, non-corrigibility and other concerns that would have Big Yud going AHHHHHH. I'm not Yudkowsky, so I'm merely seriously concerned.
Case in point-
Shutdown Resistance in Reasoning Models
These aren't agents that were explicitly trained to be self-preserving. They weren't taught that shutdown was bad. They just developed shutdown resistance as an instrumental goal for completing their assigned tasks.
This suggests something like goal-directedness emerging from systems we thought were "just" predicting the next token. It suggests the line between "oracle" and "agent" might be blurrier than we thought.
(If we can grade LLMs on their ability to break out of zoos, we must be fair and judge orangutans on their ability to prevent their sandboxed computing hardware being shutdown)
I lean more towards @TequilaMockingbird's take than yours but I agree that his explanation of why LLMs can't count threw me off. (If you ask ChatGPT why it has trouble doing simple math problems or counting r's in "strawberry," it will actually give you a pretty detailed and accurate answer!)
That said, a lot of your objections boil down to a philosophical debate about what "counts" as intelligence, and as far as that goes, I found your fish/bird metaphor profoundly unconvincing. If you define "intelligence" as "able to perform well in a specific domain" (which is what the fish judging birds to be unintelligent is doing) then we'd have to call calculators intelligent! After all, they clearly do math much better than humans.
Perhaps it would've been more accurate of me to say "This is part of the reason why LLMs have such difficulty counting..."
But even if you configure your model to treat each individual character as its own token, it is still going to struggle with counting and other basic mathematical operations in large part for the reasons I describe.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link