site banner

Is your "AI Assistant" smarter than an Orangutan? A practical engineering assessment

At the risk of doxxing myself, I have an advanced degree in Applied Mathematics. I have authored and contributed to multiple published papers, and hold a US patent all related to the use of machine learning in robotics and digital signal processing. I am currently employed as a supervising engineer by at a prominent tech company. For pseudonymity's sake I am not going to say which, but it is a name that you would recognize. I say this not to brag, but to establish some context for the following.

Imagine that you are someone who is deeply interested in space flight. You spend hours of your day thinking seriously about Orbital Mechanics and the implications of Relativity. One day you hear about a community devoted to discussing space travel and are excited at the prospect of participating. But when you get there what you find is a Star Trek fan-forum that is far more interested in talking about the Heisenberg compensators on fictional warp-drives than they are Hohmann transfers, thrust to ISP curves, or the effects on low-gravity on human physiology. That has essentially been my experience trying to discuss "Artificial Intelligence" with the rationalist community.

However at the behest of users such as @ArjinFerman and @07mk, and because X/Grok is once again in the news, I am going to take another stab at this.

Are "AI assistants" like Grok, Claude, Gemini, and DeepSeek intelligent?

I would say no, and in this post I am going to try to explain why, but to do so requires a discussion of what I think "intelligence" is and how LLMs work.

What is Intelligence
People have been philosophizing on the nature of intelligence for millennia, but for the purposes of our exercise (and my work) "intelligence" is a combination of perceptivity and reactivity. That is to say, the ability to perceive or take in new and/or changing information combined with the ability to change state based on that information. Both are necessary, and neither is sufficient on it's own. This is why Mathematicians and Computer Scientists often emphasize the use of terms like "Machine Learning" over "Artificial Intelligence" as an algorithms' behavior is almost never both.

If this definition feels unintuitive, consider it in the context of the following example. What I am saying is that an orangutan who waits until the Zookeeper is absent to use a tool to force the lock on it's enclosure is more "intelligent" than the insect that repeatedly throws itself against your kitchen window in an attempt to get outside. While they share an identical goal (to get outside) but the orangutan has demonstrated the ability to both perceive obstacles (IE the lock and the Zookeeper), and react dynamically to them in a way that the insect has not. Now obviously these qualities exist on a spectrum (try to swat a fly and it will react) but the combination of these two parameters define an axis along which we can work to evaluate both animals and algorithms, and as any good PM will tell you, the first step to solving any practical engineering problem is to identify your parameters.

Now the most common arguments for AI assistants like Grok being intelligent tend to be some variation on "Grok answered my question, ergo Grok is intelligent." or "Look at this paragraph Claude wrote, do you think you could do better?" but when evaluated against the above parameters, the ability to form grammatically correct sentences and the ability to answer questions are both orthogonal to it. An orangutan and a moth may be equally incapable of writing a Substack, but I don't expect anyone here to seriously argue that they are equally intelligent. By the same token a pocket calculator can answer questions, "what is the square root of 529?" being one example of such, but we don't typically think of pocket calculators as being "intelligent" do we?

To me, these sorts of arguments betray a significant anthropomorphic bias. That bias being the assumption that anything that a human finds complex or difficult must be computationally complex and vice versa. The truth is often the inverse. This bias leads people who do not have a background in a math or computer science to have completely unrealistic impressions of what sort of things are easy or difficult for a machine to do. For example, vector and matrix operations are a reasonably simple thing for a computer that a lot of human students struggle with. Meanwhile bipedal locomotion is something most humans do without even thinking, despite it being more computationally complex and prone to error than computing a cross product.

Speaking of vector operations, let's talk about how LLMs work...

What are LLMs
LLM stands for "Large Language Model". These models are a subset of artificial neural network that uses "Deep Learning" (essentially a fancy marketing buzzword for the combination of looping regression analysis with back-propagation) to encode a semantic token such as the word "cat" as a n-dimensional vector representing that token's relationship to the rest of the tokens in the training data. Now in actual practice these tokens can be anything, an image, an audio-clip, or a snippet of computer code, but for the purposes of this discussion I am going to assume that we are working with words/text. This process is referred to as "embedding" and what it does in effect is turn the word "cat" into something that a computer (or grad-student) can perform mathematical operations on. Any operation you might perform on a vector (addition, subtraction, transformation, matrix multiplication, etc...) can now be done on "cat".

Now because these vectors represent the relationship of the tokens to each other, words (and combinations of words) that have similar meanings will have vectors that are directionally aligned with each other. This has all sorts of interesting implications. For instance you can compute the dot product of two embedded vectors to determine whether their words are are synonyms, antonyms, or unrelated. This also allows you to do fun things like approximate the vector "cat" using the sum of the vectors "carnivorous" "quadruped" "mammal" and "feline", or subtract the vector "legs" from the vector "reptile" to find an approximation for the vector "snake". Please keep this concept of "directionality" in mind as it is important to understanding how LLMs behave, and it will come up later.

It should come as no surprise that some of the pioneers of this methodology in were also the brains behind Google Translate. You can basically take the embedded vector for "cat" from your English language model and pass it to your Spanish language model to find the vector "gato". Furthermore because all you are really doing is summing and comparing vectors you can do things like sum the vector "gato" in the Spanish model with the vector for the diminutive "-ito" and then pass it back to the English model to find the vector "kitten".

Now if what I am describing does not sound like an LLM to you, that is likely because most publicly available "LLMs" are not just an LLM. They are an LLM plus an additional interface layer that sits between the user and the actual language model. An LLM on its own is little more than a tool that turns words into math, but you can combine it with a second algorithm to do things like take in a block of text and do some distribution analysis to compute the most probable next word. This is essentially what is happening under the hood when you type a prompt into GPT or your assistant of choice.

Our Villain Lorem Epsom, and the Hallucination Problem
I've linked the YouTube video Badness = 0 a few times in prior discussions of AI as I find it to be both a solid introduction to LLMs for the lay-person, and an entertaining illustration of how anthropomorphic bias can cripple the discussion of "alignment". In it the author (who is a professor of Computer Science at Carnegie Mellon) posits a semi-demonic figure (akin to Scott Alexander's Moloch) named Lorem Epsom. The name is a play on the term Lorem Ipsom and represents the prioritization of appearance over all else. When it comes to writing, Lorem Epsom doesn't care about anything except filling the page with text that looks correct. Lorem Epsom is the kind of guy who, if you tell him that he made a mistake in the math, is liable interpret that as a personal attack. The ideas of "accuracy" "logic" "rigor" and "objective reality" are things that Lorem Epsom has heard of but that do not concern Lorem Epsom. It is very possible that you have had to deal with someone like Lorem Epsom in your life (I know I have), now think back and ask yourself how did that go?

I bring up Lorem Epsom because I think that understanding him provides some insight into why certain sorts of people are so easily fooled/taken in by AI Assistants like Claude and Grok. As discussed in the section above on "What is Intelligence", the assumption that the ability to fill a page with text is indicates the ability to perceive and react to a changing situation is an example of anthropomorphic bias. I think that a lot of people assume that because they are posing their question to a computer, they expect the answer they get to be something analogous to what they would get from a pocket calculator rather than from Lorem Epsom.

Sometime circa 2014 I kicked off a heated dispute in the comment section of a LessWrong post by asking EY why a paperclip maximizing AI that was capable of self-modification wouldn't just modify the number of paperclips in its memory. I was accused by him others and a number of others of missing the point, but I think they missed mine. The assumption that an Artificial Intelligence would not only have a notion of "truth", but assign value to it is another example of anthropomorphic bias. If you asked Lorem Epsom to maximize the number of paperclips, and he could theoretically "make" a billion-trillion paperclips simply by manipulating a few bits, why wouldn't he? It's so much more easier than cutting and bending wire.

In order to align an AI to care about truth and accuracy you first need a means of assessing and encoding truth and it turns out that this is a very difficult problem within the context of LLMs, bordering on mathematically impossible. Do you recall how LLMs encode meaning as a direction in n-dimensional space? I told you it was going to come up again.

Directionally speaking we may be able to determine that "true" is an antonym of "false" by computing their dot product. But this is not the same thing as being able to evaluate whether a statement is true or false. As an example "Mary has 2 children", "Mary has 4 children", and "Mary has 1024 children" may as well be identical statements from the perspective of an LLM. Mary has a number of children. That number is a power of 2. Now if the folks programming the interface layer were clever they might have it do something like estimate the most probable number of children based on the training data, but the number simply can not matter to the LLM the way it might matter to Mary, or to someone trying to figure out how many pizzas they ought to order for the family reunion because the "directionality" of one positive integer isn't all that different from any another. (This is why LLMs have such difficulty counting if you were wondering)

In addition to difficulty with numbers there is the more fundamental issue that directionality does not encode reality. The directionality of the statement "Donald Trump is the 47th President of the United States", would be identical regardless of whether Donald Trump won or lost the 2024 election. Directionally speaking there is no difference between a "real" court case and a "fictitious" court case with identical details.

The idea that there is a ineffable difference between true statements and false statements, or between hallucination and imagination is wholly human conceit. Simply put, a LLM that doesn't "hallucinate" doesn't generate text or images at all. It's literally just a search engine with extra steps.

What does this have to do with intelligence?
Recall that I characterized intelligence as a combination of perceptivity and and the ability to react/adapt. "AI assistants" as currently implemented struggle with both. This is partially because LLMs as currently implemented are largely static objects. They are neither able to take in new information, nor discard old. The information they have at time of embedding is the information they have. This imposes substantial loads on the context window of the interface layer, as any ability to "perceive" and subsequently "react" must happen within it's boundaries. Increasing the size of the window is non trivial as the relationship between the size of the window and the amount of memory and the number of FLOPS required is a hyperbolic curve. This is why we saw a sudden flurry of development following the release of Nvidia's multimodal framework and it's mostly been marginal improvements since. The last significant development being June of last year when the folks at Deepseek came up with some clever math to substantially reduce the size of the key value cache, but multiplicative reductions are no match for exponential growth.

This limited context window, coupled with the human tendency to anthropomorphize things is why AI Assistants sometimes appear "oblivious" or "naive" to the uninitiated. and why they seem to "double down" on mistakes. They can not perceive something that they have not been explicitly prompted to even if it is present in their training data. This limited context window is also why if you actually try to play a game of chess with Chat GPT it will forget the board-state and how pieces move after a few turns and promptly lose to a computer program written in 1976. Unlike a human player (or an Atari 2600 for that matter) your AI assistant can't just look at the board (or a representation of the board) and pick a move. This IMO places them solidly on the "insect" side of the perceptivity + reactivity spectrum.

Now there are some who have suggested that the context window problem can be solved by making the whole model less static by continuously updating and re-embedding tokens as the model runs, but I am skeptical that this would result in the sort of gains that AI boosters like Sam Altman claim. Not only would it be computationally prohibitive to do at scale, what experiments there have been (or at least that I am aware of) with self-updating language models, have quickly spun away into nonsense for reasons described in the section on Lorem Epsom., as barring some novel breakthrough in the embedding/tokenization process there is no real way to keep hallucinations and spurious inputs from rapidly overtaking the everything else.

It is already widely acknowledged amongst AI researchers and developers that the LLM-based architecture being pushed by OpenAI and DeepSeek is particularly ill-suited for any application where accuracy and/or autonomy are core concerns, and it seems to me that this unlikely to change without a complete ground-up redesign from first principles.

In conclusion, it is for the reasons above and many others that I do not believe that "AI Assistants" like Grok, Claude, and Gemini represent a viable path towards a "True AGI" along the lines of Skynet or Mr. Data, and if asked "which is smarter, Grok, Claude, Gemini, or an orangutan?" I am going to pick the orangutan every time.

19
Jump in the discussion.

No email address required.

Having no interest to get into a pissing context^W contest, I'll only disclose I've contributed to several DL R&D projects of this era.

This is the sort of text I genuinely prefer LLM outputs to, because with them, there are clear patterns of slop to dismiss. Here, I am compelled to wade through it manually. It has the trappings of a sound argument, but amounts to epitemically inept, reductionist, irritated huffing and puffing with an attempt to ride on (irrelevant) credentials and dismiss the body of discourse the author had found beneath his dignity to get familiar with, clearly having deep contempt for people working and publishing in the field (presumably ML researchers don't have degrees in mathematics or CS). Do even you believe you've said anything more substantial than “I don't like LLMs” in the end? A motivated layman definition of intelligence (not even citing Chollet or Hutter? Seriously?), a psychologizing strawman of arguments in favor of LLM intelligence, an infodump on embedding arithmetic (flawed, as already noted), random coquettish sneers and personal history, and arrogant insistence that users are getting "fooled" by LLMs producing the "appearance" of valid outputs, rather than, say, novel functioning programs matching specs (the self-evident utility of LLMs in this niche is completely sidestepped), complete with inane analogies to non-cognitive work or routine one-off tasks like calculation. Then some sloppy musings on current limitations regarding in-context learning and lifelong learning or whatever (believe me, there's a great deal of work in this direction). What was this supposed to achieve?

In 2019, Chollet has published On the Measure of Intelligence, where he has proposed the following definition: “The intelligence of a system is a measure of its skill-acquisition efficiency over a scope of tasks, with respect to priors, experience, and generalization difficulty.” It's not far from yours, because frankly it's intuitive. Starting from this idea and aiming to test fluid thinking specifically, Chollet has also proposed ARC-AGI benchmark, which for the longest time was so impossibly hard for DL systems (and specifically LLMs) that many took that as evidence for the need to do “complete ground-up redesign from first principles” to make any headway. o3 was the first LLM to truly challenge this; Chollet coped by arguing that o3 is doing something beyond DL, some “guided program synthesis” he covets. From what we know, it just autoregressively samples many CoTs in parallel and uses a simple learned function to nominate the best one. As of now, it's clearly going to be saturated within 2 years as is ARC-AGI 2, and we're on ARC-AGI 3, with costs per problem solved plummeting. Neither 1 nor 3 are possible to ace for an orangutan or indeed for a human of below-average intelligence. Similar things are happening to “Humanity's Last Exam”. Let's say it's highly improbable at this point than any “complete ground-up redesign from first principles” will be necessary. Transformer architecture is rather simple and general, making it cheaper to train and inference without deviating from the core idea of “a stack of MLPs + expressive learned mixers” is routine, and virtually all progress is achieved by means of better data – not just “cleaner” or “more”, but procedural data predicting which necessitates learning generally useful mental skills. Self-verification, self-correction, backtracking, iteration, and now tool use, search, soliciting multi-agent assistance (I recommend reading Kimi K2 report, the section 3.1.1, for an small sliver of an idea of what that entails). Assembling necessary cognitive machines in context. This is intelligence, so poorly evidenced in your texts.

In order to align an AI to care about truth and accuracy you first need a means of assessing and encoding truth and it turns out that this is a very difficult problem within the context of LLMs, bordering on mathematically impossible.

We are not in 2013 anymore, nor on LessWrong, to talk of this so abstractly and glibly. "Reptile — legs = snake" just isn't an adequate level of understanding to explain behaviors of LLMs, this fares no better than dismissing hydrology (or neuroscience, for that matter) as mere applied quantum mechanics with marketing buzzwords. Here's an example of a relevant epistemically serious 2025 paper, "The Geometry of Self-Verification in a Task-Specific Reasoning Model":

We apply DeepSeek R1-Zero’s setup with Qwen2.5-3B as our base model (Hyperparams: Appx. A). Our task, CountDown, is a simple testbed frequently used to study recent reasoning models [9, 10, 32, 39 ] – given a set of 3 or 4 operands (e.g., 19, 36, 55, 7) and target number (e.g., 65), the task is to find the right arithmetic combination of the operands to reach the target number (i.e., 55 + 36 - 7 - 19). […] The model is given two rewards: accuracy reward for reaching the correct final answer, and a format reward when it generates its CoT tokens in between “” and “” tokens. […] Once we score each previous-token head using Eq. 8, we incrementally ablate one head at a time until we achieve perfect intervention scores (Section 4.4). Using this approach, we identify as few as three attention heads that can disable model verification. We notate this subset as AVerif. To summarize, we claim that the model has subspace(s) (polytope(s)), SGLUValid , for self-verification. The model’s hidden state enters this subspace when it has verified its solution. In our setting, given the nature of our task, previous-token heads APrev take the hidden-state into this subspace, while for other tasks, different components may be used. This subspace also activates verification-related GLU weights, promoting the likelihood of tokens such as “success” to be predicted (Figure 3). […]For “non-reasoning” models, researchers have studied “truthful” representations before [ 4 ], where steering towards a “truthful” direction has led to improvements in tasks related to factual recall [ 17]. In a similar vein, researchers have shown that the model’s representations can reveal whether they will make errors (e.g., hallucinations) [ 28 ], or when they are unable to recall facts about an entity [ 8 ]. Most recently, concurrent work [37, 41 ] also investigate how models solve reasoning tasks. [ 41 ] find that models know when they have reached a solution, while [ 37 ] decode directions that mediate behaviors such as handling uncertainty or self-corrections. While our work corroborates these findings, we take a deeper dive into how a reasoning model verifies its own reasoning trace. Circuit Analysis. A growing line or work decomposes the forward pass of a neural network as “circuits” [24], or computational graphs. This allows researchers to identify key components and their causal effects for a given forward pass. A common approach to construct computational graphs is to replace model components with dense activations with a sparsely-activating approximation. [ 6] introduces Transcoders to approximate MLP layers, while [ 1 ] further develops Cross-layer Transcoders to handle inter-layer features. [18 ] uses Cross-layer Transcoders to conduct circuit analyses for a wide range of behaviors, such as multi-step reasoning (for factual recall) or addition, and also investigate when a model’s CoT is (un)faithful…

The point of this citation is to drive home that any “first principles” dismissal of LLMs is as ignorant, or indeed more ignorant, than sci-fi speculation of laymen. In short, you suck and you should learn humility to do better to corroborate your very salient claim to authority.

There are good criticisms of LLMs. I don't know if you find Terence Tao's understanding of mathematics sufficiently grounded; he's Chinese after all. He has some skepticism about LLMs contributing to deep, frontier mathematical research. Try to do more of that.

Your contributions on AI are always interesting and worth reading (not that I agree with them, but I enjoy reading them). But as much as moderation here has been accused of running on the principle "Anything is okay as long as you use enough words," it did not escape me that you used a lot of words to basically say Jane you ignorant slut!. No, burying the insults (repeated) under a lot of words does not make it okay to be this belligerent. And on a topic that should not require this much emotional investment. Your lack of chill is a you problem, but your lack of civility is a Motte problem. You do not win the argument by plastering as much condescension and disdain as you can between links.

No. This is, however, exactly what OP is doing, only he goes to more length to obfuscate it, to the point that he fails to sneak in an actual argument. It's just words. I am smart (muh creds), others are dumb (not math creds), they're naive and get fooled because they're dumb and anthropomorphise, here are some musings on animals (I still don't see what specific cognitive achievement an orangutan can boast of, as OP doesn't bother with this), here's something about embeddings, now please pretend I've said anything persuasive about LLM intelligence. That's the worst genre of a post that this forum has to offer, it's narcissistic and time-wasting. We've had the same issue with Hlynka, some people just feel that they're entitled to post gibberish on why LLMs must be unintelligent and they endeavor to support this by citing background in math while failing to state any legible connection between their (ostensible) mathematically informed beliefs and their beliefs re LLMs. I am not sure if they're just cognitively biased in some manner or if it's their ego getting in the way. It is what it is.

Like, what is this? OP smirks as he develops this theme, so presumably he believes it to be load-bearing:

[…] Please keep this concept of "directionality" in mind as it is important to understanding how LLMs behave, and it will come up later.

[…] In addition to difficulty with numbers there is the more fundamental issue that directionality does not encode reality. The directionality of the statement "Donald Trump is the 47th President of the United States", would be identical regardless of whether Donald Trump won or lost the 2024 election. Directionally speaking there is no difference between a "real" court case and a "fictitious" court case with identical details.

The idea that there is a ineffable difference between true statements and false statements, or between hallucination and imagination is wholly human conceit. Simply put, a LLM that doesn't "hallucinate" doesn't generate text or images at all. It's literally just a search engine with extra steps.

No, seriously? How does one address this? What does the vector-based implementation of representations in LLMs have to do with the ineffable difference between truth and falsehood that people dumber than OP allegedly believe in? If the pretraining data is consistent that Trump is the 47th president, then the model would predict as much and treat it as "truth". If we introduce a "falsehood" steering vector, it would predict otherwise. The training data is not baseline reality, but neither is any learned representation including world models in our brains. What does “literally just a search engine with extra steps” add here?

This sort of talk is confused on so many levels at once that the only valid takeaway is that the author is not equipped to reason at all.

I do not obfuscate. I understand that he's trying to insult me and others, and I call him an ignorant slut without any of that cowardly nonsense, plus I make an argument. To engage more productively, I'd have had to completely reinvent his stream of subtle jabs into a coherent text he might not even agree with. I'd rather he does that on his own.

some people just feel that they're entitled to post gibberish on why LLMs must be unintelligent

They are. People are in fact "entitled" to make arguments you think are gibberish. You can address the argument and why you think it is bad.

If you think he's being insulting you can say so and we'll take a look, but "I'm just going to come right out and say he's an ignorant slut, not like that coward" is doing you no credit.