Is your "AI Assistant" smarter than an Orangutan? A practical engineering assessment

At the risk of doxxing myself, I have an advanced degree in Applied Mathematics. I have authored and contributed to multiple published papers, and hold a US patent all related to the use of machine learning in robotics and digital signal processing. I am currently employed as a supervising engineer by at a prominent tech company. For pseudonymity's sake I am not going to say which, but it is a name that you would recognize. I say this not to brag, but to establish some context for the following.

Imagine that you are someone who is deeply interested in space flight. You spend hours of your day thinking seriously about Orbital Mechanics and the implications of Relativity. One day you hear about a community devoted to discussing space travel and are excited at the prospect of participating. But when you get there what you find is a Star Trek fan-forum that is far more interested in talking about the Heisenberg compensators on fictional warp-drives than they are Hohmann transfers, thrust to ISP curves, or the effects on low-gravity on human physiology. That has essentially been my experience trying to discuss "Artificial Intelligence" with the rationalist community.

However at the behest of users such as @ArjinFerman and @07mk, and because X/Grok is once again in the news, I am going to take another stab at this.

Are "AI assistants" like Grok, Claude, Gemini, and DeepSeek intelligent?

I would say no, and in this post I am going to try to explain why, but to do so requires a discussion of what I think "intelligence" is and how LLMs work.

What is Intelligence
People have been philosophizing on the nature of intelligence for millennia, but for the purposes of our exercise (and my work) "intelligence" is a combination of perceptivity and reactivity. That is to say, the ability to perceive or take in new and/or changing information combined with the ability to change state based on that information. Both are necessary, and neither is sufficient on it's own. This is why Mathematicians and Computer Scientists often emphasize the use of terms like "Machine Learning" over "Artificial Intelligence" as an algorithms' behavior is almost never both.

If this definition feels unintuitive, consider it in the context of the following example. What I am saying is that an orangutan who waits until the Zookeeper is absent to use a tool to force the lock on it's enclosure is more "intelligent" than the insect that repeatedly throws itself against your kitchen window in an attempt to get outside. While they share an identical goal (to get outside) but the orangutan has demonstrated the ability to both perceive obstacles (IE the lock and the Zookeeper), and react dynamically to them in a way that the insect has not. Now obviously these qualities exist on a spectrum (try to swat a fly and it will react) but the combination of these two parameters define an axis along which we can work to evaluate both animals and algorithms, and as any good PM will tell you, the first step to solving any practical engineering problem is to identify your parameters.

Now the most common arguments for AI assistants like Grok being intelligent tend to be some variation on "Grok answered my question, ergo Grok is intelligent." or "Look at this paragraph Claude wrote, do you think you could do better?" but when evaluated against the above parameters, the ability to form grammatically correct sentences and the ability to answer questions are both orthogonal to it. An orangutan and a moth may be equally incapable of writing a Substack, but I don't expect anyone here to seriously argue that they are equally intelligent. By the same token a pocket calculator can answer questions, "what is the square root of 529?" being one example of such, but we don't typically think of pocket calculators as being "intelligent" do we?

To me, these sorts of arguments betray a significant anthropomorphic bias. That bias being the assumption that anything that a human finds complex or difficult must be computationally complex and vice versa. The truth is often the inverse. This bias leads people who do not have a background in a math or computer science to have completely unrealistic impressions of what sort of things are easy or difficult for a machine to do. For example, vector and matrix operations are a reasonably simple thing for a computer that a lot of human students struggle with. Meanwhile bipedal locomotion is something most humans do without even thinking, despite it being more computationally complex and prone to error than computing a cross product.

Speaking of vector operations, let's talk about how LLMs work...

What are LLMs
LLM stands for "Large Language Model". These models are a subset of artificial neural network that uses "Deep Learning" (essentially a fancy marketing buzzword for the combination of looping regression analysis with back-propagation) to encode a semantic token such as the word "cat" as a n-dimensional vector representing that token's relationship to the rest of the tokens in the training data. Now in actual practice these tokens can be anything, an image, an audio-clip, or a snippet of computer code, but for the purposes of this discussion I am going to assume that we are working with words/text. This process is referred to as "embedding" and what it does in effect is turn the word "cat" into something that a computer (or grad-student) can perform mathematical operations on. Any operation you might perform on a vector (addition, subtraction, transformation, matrix multiplication, etc...) can now be done on "cat".

Now because these vectors represent the relationship of the tokens to each other, words (and combinations of words) that have similar meanings will have vectors that are directionally aligned with each other. This has all sorts of interesting implications. For instance you can compute the dot product of two embedded vectors to determine whether their words are are synonyms, antonyms, or unrelated. This also allows you to do fun things like approximate the vector "cat" using the sum of the vectors "carnivorous" "quadruped" "mammal" and "feline", or subtract the vector "legs" from the vector "reptile" to find an approximation for the vector "snake". Please keep this concept of "directionality" in mind as it is important to understanding how LLMs behave, and it will come up later.

It should come as no surprise that some of the pioneers of this methodology in were also the brains behind Google Translate. You can basically take the embedded vector for "cat" from your English language model and pass it to your Spanish language model to find the vector "gato". Furthermore because all you are really doing is summing and comparing vectors you can do things like sum the vector "gato" in the Spanish model with the vector for the diminutive "-ito" and then pass it back to the English model to find the vector "kitten".

Now if what I am describing does not sound like an LLM to you, that is likely because most publicly available "LLMs" are not just an LLM. They are an LLM plus an additional interface layer that sits between the user and the actual language model. An LLM on its own is little more than a tool that turns words into math, but you can combine it with a second algorithm to do things like take in a block of text and do some distribution analysis to compute the most probable next word. This is essentially what is happening under the hood when you type a prompt into GPT or your assistant of choice.

Our Villain Lorem Epsom, and the Hallucination Problem
I've linked the YouTube video Badness = 0 a few times in prior discussions of AI as I find it to be both a solid introduction to LLMs for the lay-person, and an entertaining illustration of how anthropomorphic bias can cripple the discussion of "alignment". In it the author (who is a professor of Computer Science at Carnegie Mellon) posits a semi-demonic figure (akin to Scott Alexander's Moloch) named Lorem Epsom. The name is a play on the term Lorem Ipsom and represents the prioritization of appearance over all else. When it comes to writing, Lorem Epsom doesn't care about anything except filling the page with text that looks correct. Lorem Epsom is the kind of guy who, if you tell him that he made a mistake in the math, is liable interpret that as a personal attack. The ideas of "accuracy" "logic" "rigor" and "objective reality" are things that Lorem Epsom has heard of but that do not concern Lorem Epsom. It is very possible that you have had to deal with someone like Lorem Epsom in your life (I know I have), now think back and ask yourself how did that go?

I bring up Lorem Epsom because I think that understanding him provides some insight into why certain sorts of people are so easily fooled/taken in by AI Assistants like Claude and Grok. As discussed in the section above on "What is Intelligence", the assumption that the ability to fill a page with text is indicates the ability to perceive and react to a changing situation is an example of anthropomorphic bias. I think that a lot of people assume that because they are posing their question to a computer, they expect the answer they get to be something analogous to what they would get from a pocket calculator rather than from Lorem Epsom.

Sometime circa 2014 I kicked off a heated dispute in the comment section of a LessWrong post by asking EY why a paperclip maximizing AI that was capable of self-modification wouldn't just modify the number of paperclips in its memory. I was accused by him others and a number of others of missing the point, but I think they missed mine. The assumption that an Artificial Intelligence would not only have a notion of "truth", but assign value to it is another example of anthropomorphic bias. If you asked Lorem Epsom to maximize the number of paperclips, and he could theoretically "make" a billion-trillion paperclips simply by manipulating a few bits, why wouldn't he? It's so much more easier than cutting and bending wire.

In order to align an AI to care about truth and accuracy you first need a means of assessing and encoding truth and it turns out that this is a very difficult problem within the context of LLMs, bordering on mathematically impossible. Do you recall how LLMs encode meaning as a direction in n-dimensional space? I told you it was going to come up again.

Directionally speaking we may be able to determine that "true" is an antonym of "false" by computing their dot product. But this is not the same thing as being able to evaluate whether a statement is true or false. As an example "Mary has 2 children", "Mary has 4 children", and "Mary has 1024 children" may as well be identical statements from the perspective of an LLM. Mary has a number of children. That number is a power of 2. Now if the folks programming the interface layer were clever they might have it do something like estimate the most probable number of children based on the training data, but the number simply can not matter to the LLM the way it might matter to Mary, or to someone trying to figure out how many pizzas they ought to order for the family reunion because the "directionality" of one positive integer isn't all that different from any another. (This is why LLMs have such difficulty counting if you were wondering)

In addition to difficulty with numbers there is the more fundamental issue that directionality does not encode reality. The directionality of the statement "Donald Trump is the 47th President of the United States", would be identical regardless of whether Donald Trump won or lost the 2024 election. Directionally speaking there is no difference between a "real" court case and a "fictitious" court case with identical details.

The idea that there is a ineffable difference between true statements and false statements, or between hallucination and imagination is wholly human conceit. Simply put, a LLM that doesn't "hallucinate" doesn't generate text or images at all. It's literally just a search engine with extra steps.

What does this have to do with intelligence?
Recall that I characterized intelligence as a combination of perceptivity and and the ability to react/adapt. "AI assistants" as currently implemented struggle with both. This is partially because LLMs as currently implemented are largely static objects. They are neither able to take in new information, nor discard old. The information they have at time of embedding is the information they have. This imposes substantial loads on the context window of the interface layer, as any ability to "perceive" and subsequently "react" must happen within it's boundaries. Increasing the size of the window is non trivial as the relationship between the size of the window and the amount of memory and the number of FLOPS required is a hyperbolic curve. This is why we saw a sudden flurry of development following the release of Nvidia's multimodal framework and it's mostly been marginal improvements since. The last significant development being June of last year when the folks at Deepseek came up with some clever math to substantially reduce the size of the key value cache, but multiplicative reductions are no match for exponential growth.

This limited context window, coupled with the human tendency to anthropomorphize things is why AI Assistants sometimes appear "oblivious" or "naive" to the uninitiated. and why they seem to "double down" on mistakes. They can not perceive something that they have not been explicitly prompted to even if it is present in their training data. This limited context window is also why if you actually try to play a game of chess with Chat GPT it will forget the board-state and how pieces move after a few turns and promptly lose to a computer program written in 1976. Unlike a human player (or an Atari 2600 for that matter) your AI assistant can't just look at the board (or a representation of the board) and pick a move. This IMO places them solidly on the "insect" side of the perceptivity + reactivity spectrum.

Now there are some who have suggested that the context window problem can be solved by making the whole model less static by continuously updating and re-embedding tokens as the model runs, but I am skeptical that this would result in the sort of gains that AI boosters like Sam Altman claim. Not only would it be computationally prohibitive to do at scale, what experiments there have been (or at least that I am aware of) with self-updating language models, have quickly spun away into nonsense for reasons described in the section on Lorem Epsom., as barring some novel breakthrough in the embedding/tokenization process there is no real way to keep hallucinations and spurious inputs from rapidly overtaking the everything else.

It is already widely acknowledged amongst AI researchers and developers that the LLM-based architecture being pushed by OpenAI and DeepSeek is particularly ill-suited for any application where accuracy and/or autonomy are core concerns, and it seems to me that this unlikely to change without a complete ground-up redesign from first principles.

In conclusion, it is for the reasons above and many others that I do not believe that "AI Assistants" like Grok, Claude, and Gemini represent a viable path towards a "True AGI" along the lines of Skynet or Mr. Data, and if asked "which is smarter, Grok, Claude, Gemini, or an orangutan?" I am going to pick the orangutan every time.

Jump in the discussion.

No email address required.

EverythingIsFine Well, is eventually fine 2hr ago · Edited 2hr ago

Overall I agree, and think it's an excellent post, but with a few quibbles and thoughts... well, at least "a few" was my intention. I think my thoughts ballooned once I started sketching out some bullet points and an outline, so they are no longer bullet points. I will try to keep each paragraph roughly its own "thought" however.

As an aside, I haven't looked into it enough to tell if an LLM can change tacks and re-organize quite like this, or decide to take unusual approaches once in a while to get a point across. My intuition says that the answer is probably yes to the first, but no to the second, as manifested by the semi-bland outputs that LLMs tend to produce. How often to LLMs spontaneously produce analogies, for example, to get a point across, and carry said analogy throughout the writing? Not that often, but neither do humans I guess - still, less often IME. I think I should come out and say that judging LLM capabilities relative to what we'd expect out of an educated human is the most sensible point of comparison. I don't think it's excessively anthropomorphizing to do so, because we ARE the closest analogue. It also is easier to reason about, and so is useful. Of course it goes without saying that in the "back of your head" you should maintain an awareness that the thought patterns are potentially quite different.

While the current paradigm is next-token-prediction based models, there is such a thing as diffusion text models, which aren't used in the state of the art stuff, but nonetheless work all right. Some of the lessons we are describing here don't generalize to diffusion models, but we can talk about them when or if they become more mainstream. There are a few perhaps waiting in the stables, for example Google semi-recently demoed one. For those not aware, a diffusion model does something maybe, sort of, kind of like how I wrote this comment: sketched out a few bullet points overall, and then refined piece by piece, adding detail to each part. One summary of their strengths and weaknesses here. It's pretty important to emphasize this fact, because arguably our brains work on both levels: we come up with, and crystallize, concepts, in our minds during the "thinking" process (diffusion-like), even though our output is ultimately linear and ordered (and to some extent people think as they speak in a very real way).

So the major quibble pointed out below is that tokenization is a big part of why counting doesn't work as expected. I think it's super critical to state that LLMs ONLY witness the world through the lens of tokens. Yes, humans also do this, but differently (e.g. it's well known that in reading, we sometimes look at the letter that starts and ends the word but the letters in between can sometimes be scrambled without you noticing right away). It's like how a human can only mostly process colors visible to us. There are things that are effectively invisible to an LLM. Even if an LLM is smart enough to disentangle a word into its constituent letters, or a number into its constituent digits, the training data there is pretty weak.

Which leads me to another critical point, not pointed out: LLMs have trouble with things that don't exist in their training data, and actually we have some major gaps there. I'm speaking of things that are intuitive and obvious to people are not always written down, and in fact sometimes the opposite is the case! While an LLM has surely ingested many textbooks on kindergarten basics, it won't have actually experienced a kindergarten classroom. It will learn that kids run inside when it starts to rain, but more weakly learns that kids don't like to get wet. There's also a more limited spatial awareness. Perhaps it's like listening to someone describe the experience of listening to music if you are deaf? That's what a lot of text with implications for real life is like. The LLM has no direct sense at all and is only observing things through knock-on effects.

There are also issues with something that is partially taught but intuitively applied: how much to trust a given source, and what biases they might have. An LLM might read or ingest a document, but not think to consider the source (are they biased? are they an authority figure? are they guessing? all the things an English or history class attempts to teach more explicitly). Nope, it's still just doing next-token prediction on some level, and doesn't have the theory of mind to take a step back from time to time (unless prompted, or trained very explicitly). We can see this weakness manifest where the "grandma trick" is so consistently useful: you tell the LLM that you are some role, and it will believe you. Yes, that's kind of cheating because the trainers of the model don't want the LLM to constantly doubt the prompter, but it's also partly inherent. The LLM doesn't naturally have an instinct to take a step back. Better post-training might help this, but I kind of doubt it, because it won't be as stable as if it's more properly baked into the normal training process.

I've danced around this until now, but want to state this more directly. We are of course critical of how an LLM "thinks" but we don't actually understand quite what happens on a human-cognition level anyways, so we can't actually judge this fairly. Maybe it's closer than we think, but maybe it's farther away. The only way we have of observing human cognition is through inferences from snap judgements, an assortment of experiments, and hints from brain scans as to which regions activate in which scenarios/how strongly/what order. We have some analogous capabilities for LLMs (e.g. observing feature activation such as with Golden Gate Claude besides the usual experiments and even examining output token probability weights). Actually, on that note, I consider at the very least the post summary if not the paper just linked to be mandatory reading for anyone seeking to understand how LLMs function. It's just such a useful experiment and explainer. I will revisit this point, along with how some newer models also employ a "Mixture of Experts" approach, a little later, but for now let's remember that we don't know how humans think on a lower level, so we shouldn't expect too much out of figuring out the machine learning stuff either.

LLM's don't actually learn physics, which has important implications for if we can consider LLMs to have "world models" as they sometimes say. There's a nice 3 minute video accompanying that post. They try and have some vision models learn rules of physics with some very simple circles bouncing around. Obviously something pretty simple. If you give this to a young human, they will make some analogies with the real world, perhaps run an experiment or two, and figure it out pretty quickly as a generalization. We should however state that humans too have some processing quirks and shortcuts used in vision not unlike some of the issues we encounter with tokenization or basic perception, but these are on a different level. They are basic failures to generalize. For example, when referencing training data, it seems to pay attention to things in this order: color > size > velocity > shape. Obviously, that's incorrect. Sometimes shapes will even morph into something else when moving alone! I should disclaim that I don't know a whole lot about the multimodal outputs, though.

There are some evangelists that believe the embedded "concepts", mentioned in the Golden Gate Claude study, are true reasoning. How else, Ilya Sutskever asks, can a model arrive at the correct answer? Honestly as I mentioned referencing how we don't understand how human brains reason completely, I think the jury is out on this one. My guess would be no, however, these concepts aren't full reasoning. They are more like traditional ML feature clusters.

Re: Truth and falsehood. I think there's mild evidence that LLMs do in fact distinguish the two; it's just that these concepts are very fragile especially as compared to humans. I reference to some extent the physics point above: the model doesn't seem to "get" that a shape changing in the middle of an output is a "big deal", but a human would intuitively, without any actual instruction to that effect (instruction also so obvious it might not explicitly be taught in training data). One good piece of evidence for distinguishing true and false is here and related "emergent misalignment" research: how if you fine-tune an LLM to produce insecure (hack-prone) code, it also starts behaving badly in other areas! It will start lying, giving malicious advice, and other "bad" behavior. To me, that suggests that there are a few moral-aligned features or concepts embedded in an LLM's understanding that seem to broadly align with a vague sense of morality and truth. I recognize there's a little conflation there, but why else would an LLM trained on "bad" code start behaving badly in areas that have nothing to do with coding? As evidence for the fragility, however, of true and false, one need only get into a small handful of "debates" with an LLM about what is true and what isn't to see that sometimes it digs in its heels, but other times rolls over belly-up, often seemingly irrationally (as in, it's hard to figure out how hard it will resist).

Circling back to the physics example, causality is something that an LLM doesn't understand, as is its cousin: experimentation. I will grant that humans don't always fully experiment to their full potential, but they do on some level, where LLMs aren't quite there. I posit that a very important part of how humans learn is trying something, and seeing what happens, in all areas! The current LLM pipeline does not allow for this. Agentic behavior is all utilization, and doesn't affect the model weights. Tuning an LLM to work as a chatbot allows the LLM to try and do completion, but doesn't have a component where the LLM will try things out. The closest thing is RLHF and related areas, where the LLM will pick the best of a few options, but this isn't quite organic; the modality of this conversation is fundamentally in a chat paradigm, not the original training paradigm. It's not a true free-form area to learn cause and effect.

Either way, and this is where posts like yours are very, very valuable (along with videos like this, a good use of 3.5 hours if you don't know how they work at all) the point about how LLMs work in layers is absolutely critical; IMO, you cannot have a reasonable discussion about the limits of AI with anyone unless they have at least a general understanding of how the pre-training, training, post-training processes work, plus maybe a general idea of the math. So many "weird" behaviors suddenly start to make sense if you understand a little bit about how an LLM comes to be.

That's not to say that understanding the process is all you need. I mentioned above that some new models use Mixture of Experts, which have a variety of interesting implementations that can differ significantly, and dilute a few of the model-structure implications I just made, though they are still quite useful. I personally need to brush up on the latest a little. But in general, these models seem to "route" a given text into a different subset of features within the neural network model. To some extent these are determined as an architecture choice before training, but often make their influence heard later on (or can even be fine-tuned near the end).

Intelligence. First of all, I think it feels a little silly to have a debate about labels. Labels change according to the needs. Let's not try and pidgeonhole LLMs as they currently are. We can't treat cars like horseless carriages, we can't treat LLMs like humans. Any new tech will usually have at least one major unexpected advantage and one major unexpected shortcoming, and these are really hard to predict.

At the end of the day, I like how one researcher (Andrej Karpathy) puts it: LLMs exhibit jagged intelligence. The contours of what they can and can't do simply don't follow established/traditional paradigms, some capabilities are way better than others, and the consistency varies greatly. I realize that's not a yes/no answer, but it seems to make the most sense, and convey the right intuition and connotation to the median reader.

Overall I think that we do need some major additional "invention" to get something that reflects more "true" intelligence, in the sense we often mean it. One addition, for example, would be to have LLMs have some more agentic behavior earlier in their lifespan, the experimentation and experience aspect. Another innovation that might make a big difference is memory. Context is NOT memory. It's frozen, and it influences outputs only. Memory is a very important part of personality as well as why humans "work"! And LLMs basically do not have any similar capability.

Current "memories" that ChatGPT uses are more like stealth insertion of stuff into the system prompt (which is itself just a "privileged" piece of context) than what we actually mean. Lack of memory causes more obvious and immediate problems, too: when we had Claude Plays Pokemon, a major issue was that Claude (like many LLMs) struggles to figure out which part of its context matters more at any given time. It also is a pretty slapdash solution that gets filled up quickly. Instead of actual memory, Claude is instructed to offload part of what it needs to keep track of to a notepad, but needs to update and condense said notepad regularly because it doesn't have the proper theory of mind to put the right things there, in the right level of detail. And on top of it all, LLMs don't understand spatial reasoning completely, so it has trouble with basic navigation. (There are also some amusing quicks, too: Claude expects people to be helpful, so constantly tries to ask for help from people standing around. It never figures out that the people offer canned phrases that are often irrelevant but occasionally offer a linear perspective on what to do next, and it struggles to contextualize those "hints" when they do come up! He just has too much faith in humanity, haha)

Finally, a difficult question: can't we just ask the LLM itself? No. Human text used for training is so inherently self-reflecting that it's very difficult if not impossible to figure out if the LLM is conscious because we've already explored that question in too much detail and the models are able to fake it too well! We thus have no way to distinguish what's an original LLM thought vs something that its statistical algorithm output. Yes, we have loosely the same problem with humans, too, but humans have limits for what we can hold in our brain at once! (We also see that humans have, arguably, a kind of jagged intelligence too. Why are humans so good at remembering faces, but so bad at remembering names? I could probably come up with a better example but whatever, I'm tired boss). This has implications, I've always thought, for copyright. We don't penalize a human for reading a book, and then using its ideas in a distilled form later. But an LLM can read all the books ever written, and use their ideas in a distilled form later. Does scale matter? Yes, but also no.

Also, how incredibly good the LLM is at going convincingly through the motions without understanding the core reality is coming up all the time these days. When, as linked below, an LLM deletes your whole database, it apologizes and mimics what you'd expect it to say. Fine, okay, arguably you want the LLM to apologize like that, but what if the LLM is put in charge of something real? Anthropic recently put Claude in charge of a vending machine at their work, writeup here, and the failure modes are interesting - and, if you understand the model structure, completely understandable. It convinces itself at one point that it's having a real conversation with someone in the building over restocking plans, and is uniquely incapable of realizing this error and rescuing itself early enough, instead continuing the hallucination for a while before suddenly "snapping" out of a role-play. Perhaps some additional post-training on how its, um, not a real person could reduce the behavior, but the fact it occurs at all demonstrates how out of sample, the LLM has no internal mental representation.

Context

rae 4hr ago · Edited 4hr ago

I’m sorry but the way you started off by introducing yourself as an expert qualified in the subject matter, followed by completely incorrect technical explanations, kinda rubbed me the wrong way. To me it came across as someone quite intelligent venturing in a different technical field to their own, skimming the literature, and making authoritatively baseless sweeping claims while not having understood the basics. I’m not a fan of the many of the rarionalists’ approach to AI which I agree can border on science fiction, but you’re engaging in a similar kind of technical misunderstanding, just with a different veneer.

Just a few glaring errors:

LLM stands for "Large Language Model". These models are a subset of artificial neural network that uses "Deep Learning" (essentially a fancy marketing buzzword for the combination of looping regression analysis with back-propagation)

Deep learning may be a buzzword but it’s not looping regression analysis, nor is it limited to backprop. It’s used to refer to sufficiently deep neural works (sometimes that just means more than 2 layers), but the training objective can be classification, regression, adversarial… and you can theoretically use other algorithms than backprop (but that’s mostly restricted to research now).

to encode a semantic token such as the word "cat" as a n-dimensional vector representing that token's relationship to the rest of the tokens in the training data.

That’s just flat out wrong. Autoregressive LLMs such as GPT or whatnot are not trained to encode tokens into embeddings. They’re decoder models, trained to predict the next token from a context window. There is no “additional interface layer” that gets you words from embeddings, they directly output a probability for each possible next token given a previous block, and you can just pick the highest probable token and directly get meaningful outputs, although in practice you want more sophisticated stochastic samplers than pure greedy decoding.

You can get embeddings from LLMs by grabbing intermediate layers (this is where the deep part of deep learning comes into play, models like llama 70B have 80 layers), but those embeddings will be heavily dependent on the context. These will hold vastly more information than the classic word2vec embeddings you’re talking about.

Maybe you’re confusing the LLM with the tokenizer (which generates token IDs), and what you call the “interface layer” is the actual LLM? I don’t think you’re referring to the sampler, although it’s possible, but then this part confuses me even more:

As an example "Mary has 2 children", "Mary has 4 children", and "Mary has 1024 children" may as well be identical statements from the perspective of an LLM. Mary has a number of children. That number is a power of 2. Now if the folks programming the interface layer were clever they might have it do something like estimate the most probable number of children based on the training data, but the number simply can not matter to the LLM the way it might matter to Mary, or to someone trying to figure out how many pizzas they ought to order for the family reunion because the "directionality" of one positive integer isn't all that different from any another. (This is why LLMs have such difficulty counting if you were wondering)

This is nonsense. Not only is there no “interface layer” being programmed, but 2, 4, 1024 are completely different outputs and will have different probabilities depending on the context. You can try it now with any old model and see that 1024 is the least probable of the three. LLMs entire shtick is outputting the most probable response given the context and the training data, and they have learned some impressive capabilities along the way. The LLMs will absolutely have learned that the probable number of pizzas for a given number of people. They also have much larger context windows (in the millions for Gemini models), although they are not trained to effectively use them and still have issues with recall and logic.

Fundamentally, LLMs are text simulators. Learning the concept of truth is very useful to simulate text, and as @self_made_human noted, there’s research showing they do possess a vector or direction of “truth”, which is quite useful for simulating text. Thinking of the LLM as an entity, or just a next word predictor, doesn’t give you a correct picture. It’s not an intelligence. It’s more like a world engine, where the world is all text, which has been fine tuned to mostly simulate one entity (the helpful assistant), but the LLM isn’t the assistant, the assistant is inside the LLM.

EverythingIsFine Well, is eventually fine rae 2hr ago

Charitably, I'd say OP sacrificed a bit of accuracy to attempt and convey a point. There really isn't a great way of conveying how text can be represented in terms of matrices to someone who has little prior experience, without an analogy like word2vec-like embeddings, so it's a common lead-in or stepladder of understanding even if incorrect. I'd say the gains made in teaching intuition are worth the tradeoffs in accuracy, but I'd agree it's bad form to not acknowledge the shortcut (again, I'm speaking charitably here).

I'd say rather than try and make an analogy for the assistant, it's better just to demonstrate to readers how the "bridge" from next-token-prediction to chatbot works directly, like in the middle section of the long explainer video I linked. Essentially you are just still doing prediction, but you are "tricking" the model into thinking it's following a pre-set conversation it needs to continue, via special tokens for whose turn it is to speak, and when a response is finished. This has important theory of mind implications, because the LLM never actually does anything other than prediction! But the "trick" works unreasonably well. And it comes full circle, back to "well, how did we train it and what did we feed it?" which is, of course, the best first question to ask as any good data scientist will tell you (understand your inputs!).

self_made_human Grippy socks, grippy box EverythingIsFine 51m ago

Charitably, I'd say OP sacrificed a bit of accuracy to attempt and convey a point.

I would have let it slide, except for the fact that it was followed up by:

Both claims are wrong, and using the former to justify the latter is confused and incorrect thinking.

muzzle-cleaned-porg-42 EverythingIsFine 59m ago

Yes, but the problem is that OPs 'sacrificed accuracy' level explanation about dot products of word vectors is clearly an explanation of a different architecture, word embedding model such as word2vec, which was all the rage in 2013. Charitably, yes old transformer based LLMs usually had an embedding layer as a pre-processing step to reduce the input dimension (I think the old gpt papers described an embedding layer step and it is mentioned in all the conceptual tutorials). but the killer feature that makes LLMs a massive industry is not the 2010s-tier embeddings (I don't know, do the modern models even have them today?), it is the transformer architecture (multi-head attention, multiple levels of fancy kind of matrix products) where all the billions of parameters go and which has nearly magical capability in next-word-prediction with word context and relationships to produce intelligible text.

kky 7hr ago

Good post. Interesting to see how your perspective intersects with the other critics of LLMs, like Gary Marcus’ consistently effective methods for getting the systems to spit out absurd output.

In my own experience, the actual current value of neural network systems (and thus LLMs) is fuzzy UIs or APIs. Traditional software relies on static algorithms that expect consistent and limited data which can be transformed in highly predictable ways. They don’t handle rougher data very well. LLMs, however, can make a stab at analyzing arbitrary human input and matching it to statistically likely output. It’s thus useful for querying for things where you don’t already know the keywords - like, say, asking which combination of shell utilities will perform as you desire. As people get more used to LLMs, I predict we will see them tuned more to specialized use cases in UI and less to “general” text, and suddenly become quite profitable for a focused little industry.

LLMs will be useful as a sort of image recognition for text. Image recognition is useful! But it is not especially intelligent.

EverythingIsFine Well, is eventually fine kky 2hr ago

An interesting use case that I've seen evangelized is if we can get LLMs to produce bespoke UI well and natively. The current paradigm is that a programmer sets up the behaviors of your app or website or whatever, but what if an LLM can generate, on an ad-hoc basis, a wrapper and interaction layer to whatever you want to do, dynamically? Could be cool.

JarJarJedi 10hr ago

Just today saw this: https://www.pcgamer.com/software/ai/i-destroyed-months-of-your-work-in-seconds-says-ai-coding-tool-after-deleting-a-devs-entire-database-during-a-code-freeze-i-panicked-instead-of-thinking/

So much fascinating stuff there - from people giving an LLM unfiltered access to vital business functions, and then having no shame to tell about it on the internet to the model cheerfully reporting "yes, I deleted your production database, yes, I ignored all permissions and instructions, yes, it is a catastrophic failure, can I help you with anything else now?" I knew Black Mirror is closer to reality than I'd like to, but I didn't expect it to become practically a documentary already.

TequilaMockingbird Brown-skinned Fascist MAGA boot-licker JarJarJedi 9hr ago

The hilarious thing about this for me is that I have literally used "You ask the LLM to "minimize resource utilization" and it deletes your Repo" as an example in training for new hires.

...and this children is why you need to be mindful of your prompts, and pushes to "master" require multi-factor authentication.

JarJarJedi TequilaMockingbird 9hr ago

Stanislaw Lem had as a teaching example the tale of the robot being asked to clean the old storage closet full of disused globes with the prompt "remove all spherical objects from this room". It did it perfectly, and also removed the operator's head too - it looked spherical enough to match. I think that was in The Magellanic Cloud.

Amadan Enjoying my short-lived victory JarJarJedi 9hr ago

And they didn't have their production code and databases backed up?

Looks someone I saw ranting on Reddit the other day about how Claude let them down. Apparently they are a startup that has built an LLM-run CI/CD pipeline. The code checker? Also an LLM. The merge request approver? An LLM. Basically their entire development process is "automated" by LLMs, with humans intervening only when something goes wrong. Surprise, something went wrong. The CTO blames this on Claude, despite multiple engineers telling him his pipeline is stretching LLMs well beyond the limits of what they can reliably do at this time.

Pretty soon people are going to start getting catfished and Nigerian prince -scammed by LLMs.

JarJarJedi Amadan 9hr ago

As hilarious as it sounds, with this "vibe coding" thing I totally expect it. I mean, this is a magic machine, why would I need "backups"? If there were the need for backups, the magic machine would make some, by magic. Since it didn't, it must be just some stupid superstition boomer coders invented to justify their inflated salaries.

gattsuru JarJarJedi 9hr ago

It does seem like it was a demo application, so it's not quite as scary at the robot sounds. But it's still absolutely not something you want happening even in a demo. And he seems like if he got lucky enough for long enough, he would have tried it on a real business application.

Some of the weirdness reflects the guy intentionally writing this up a running commentary, and often a critical one. My gutcheck is that he's more manager (or 'promoter') first that picked up some programming, and that might also be part of the weird framework (such as treating 'code freeze' like a magic work that the LLM would be able to toggle), though I haven't looked too closely at his background. The revelation here is absolutely obvious to anyone who's let a junior dev or intern anywhere near postgresql, but it's obvious because so many people learn it the hard way that 'dropped data in prod' is the free bingo of nightmare scenarios.

Some of it reflects a genuine issue with Replit's design, separate from the LLM. (how much of that is vibe-coded? gfl). There's a genuine and deep criticism that this should have a very wide separation from testing to demo to production built into the infrastructure of the environment, or some rollback capability.

But that does get back to a point where he seems to think guardrails are just a snap-on option, and that's not really easy for pretty basic design reasons. Sandboxing is hard. Sandboxing when you also want to have access to port 80, and database admin rights, and sudo, and file access near everywhere, I'm not sure it's possible.

Edawayac_Tosscount 12hr ago

I assent to everything you said, albeit without any of the prerequisite expertise to give me proper knowledge. In short, and I hope this does not do your piece a rhetorical disservice, I vibe with it.

I've dealt with the products of the current AI paradigm as a mere enthusiast, watching 4chan /g/ threads from about 2021 and onward, looking on with both excitement and disappointment as text and imagegen models, though both increasingly easy to deploy in reduced scope on consumer hardware and increasingly capable when developed and hosted by professionals, nonetheless retained epistemic and recollective issues that, while capable of being papered over with judicious use of the context window and ever-more training data processing power and storage, nonetheless gave me the impression that there was a fundamental kink in the underlying implementation of mainstream "AI" that would prevent that implementation from ever achieving the messianic (or demonic, or, at the very least, economic) hopes foisted onto it.

That said, I'm provisionally materialist, so barring me becoming convinced of the human soul I don't see why in principle software couldn't achieve incredible intelligence, either by your definition of it or in some more nebulous sense. I'm just thoroughly disappointed by the hopes piled onto (and consumer software & web services tainted by) the current "AI" bandwagon.

FCfromSSC Nuclear levels of sour 13hr ago

I really appreciate you taking the time to write this. It makes an interesting counterpoint to a discussion I had over the weekend with a family member who's using AI in a business setting to fill a 24/7 public-facing customer service role, apparently with great success; they're using this AI assistant to essentially fill two or three human jobs, and filling it better than most and perhaps all humans would. On the other hand, this job could perhaps be reasonably compared to a fly beating its head against a wall; one of the reasons they set the AI up was that it was work very few humans would want to do.

AI is observably pretty good at some things and bad at other things. If I think of the map of these things like an image of perlin noise, there's random areas that are white (good performance) and black (bad performance). The common model seems to be that the black spaces are null state, and LLMs spread white space; as the LLMs improve they'll gradually paint the whole space white. If I'm understanding you, LLMs actually paint both black and white space; reducing words to vectors makes them manipulable in some ways and destroys their manipulability in others, not due to high-level training decisions but due to the core nature of what an LLM is.

If this is correct, then the progress we'll see will revolve around exploiting what the LLMs are good at rather than expanding the range of things they're good at. The problem is that we aren't actually sure what they're good at yet, or how to use them, so this doesn't resolve into actionable predictions. If one of the things they're potentially good at is coding better AIs, we still get FOOM.

TequilaMockingbird Brown-skinned Fascist MAGA boot-licker FCfromSSC 8hr ago · Edited 7hr ago

Im not sure if it's fair to say it "destroys" anything, but it certainly fails to capture certain sorts of things and in the end the result is the same.

A lot of the frustration I've experienced, stems from these sorts issues where some guy who spends more time writing for thier substack than they do writing code dismisses issues such as those described in the section on Lorem Epsom as trivialities that will soon be rendered moot by Moore's Law. No bro they wont, If you're serious about "AI Alignment" solving those sort of issues is going to be something like 90% of the actual work.

As for the "foom" scenario, i am extremely skeptical but i could also be wrong.

self_made_human Grippy socks, grippy box 13hr ago · Edited 11hr ago

In defence of our friendly neighborhood xeno-intelligences being smarter than an orangutan

I appreciate you taking the time to write this, as well as offering a gears-and-mechanisms level explanation of why you hold such beliefs. Of course, I have many objections, some philosophical, and even more of them technical. Very well then:

I want to start with a story. Imagine you're a fish, and you've spent your whole life defining intelligence as "the ability to swim really well and navigate underwater currents." One day, someone shows you a bird and asks whether it's intelligent. "Of course not," you say. "Look at it flailing around in the water. It can barely move three feet without drowning. My goldfish cousin is more intelligent than that thing."

This is roughly the situation we find ourselves in when comparing AI assistants to orangutans.

Your definition of intelligence relies heavily on what AI researchers call "agentic" behavior - the ability to perceive changing environments and react dynamically to them. This was a perfectly reasonable assumption to make until, oh, about 2020 or so. Every entity we'd previously labeled "intelligent" was alive, biological, and needed to navigate physical environments to survive. Of course they'd be agents!

But something funny happened on the way to the singularity. We built minds that don't fit this pattern.

Before LLMs were even a gleam in Attention Is All You Need's eye, AI researchers distinguished between "oracle" AIs and "tool" AIs. Oracle AIs sit there and answer questions when asked. Tool AIs go out and do things. The conventional wisdom was that these were fundamentally different architectures.

As Gwern explains, writing before the advent of LLMs , this is an artificial distinction.

You can turn any oracle into a tool by asking it the right question: "What code would solve this problem?" or "What would a tool-using AI output in response to this query?" Once you have the code, you can run it. Once you know what the tool-AI would do, you can do it yourself. Robots run off code too, so you have no issues applying this to the physical world.

Base models are oracles that only care about producing the next most likely token based on the distribution they have learned. However, chatbots that people are likely to use have had additional Reinforcement Learning from Human Feedback, in order to behave like the platonic ideal of a helpful, harmless assistant. More recent models, o1 onwards, have further training with the explicit intent of making them more agentic, while also making them more rigorous, such as Reinforcement Learning from Verified Reward.

Being agents doesn't come naturally to LLMs, it has to be beaten into them like training a cat to fetch or a human to enjoy small talk. Yet it can be beaten into them. This is highly counter-intuitive behavior, at least to humans who are used to seeing every other example of intelligence under the sun behave in a different manner. After all, in biological intelligence, agency seems to emerge automatically from the basic need to not die.

Your account of embedding arithmetic is closer to word2vec/GloVe. Transformers learn contextual token representations at every layer. The representation of “cat” in “The cat is on the mat” and “Cat 6 cable” diverges. There is heavy superposition and sparse distributed coding, not a simple static n-vector per word. Operations are not limited to dot products; attention heads implement soft pointer lookups and pattern matching, and MLP blocks implement non-linear feature detectors. So the claim “Mary has 2 children” and “Mary has 1024 children” are indistinguishable is empirically false: models can do arithmetic, compare magnitudes, and pass unit tests on numerical reasoning when prompted or fine-tuned correctly. They still fail often, but the failures are quantitative, not categorical impossibilities of the embedding geometry.

(I'll return to the arithmetic question shortly, because TequilaMockingbird makes a common but significant error about why LLMs struggle with counting.)

Back to the issues with your definition of intelligence:

My first objection is that this definition, while useful for robotics and control systems, seems to hamstring our understanding of intelligence in other domains. Is a brilliant mathematician, floating in a sensory deprivation tank with no new sensory input, thinking through a proof, not intelligent? They have zero perceptivity of the outside world and their only reaction is internal state change. Your definition is one of embodied, environmental agency. It's an okay definition for an animal or a robot, but is it the only one? LLMs are intelligent in a different substrate: the vast, static-but-structured environment of human knowledge. Their "perception" is the prompt, and their "reaction" is to navigate the latent space of all text to generate a coherent response. Hell, just about any form of data can be input into a transformer model, as long as we tokenize it. Calling them Large "Language" Models is a gross misnomer these days, when they accept not just text, but audio, images, video or even protein structure (in the case of AlphaFold). All the input humans accept bottoms out in binary electrical signals from neurons firing, so this isn't an issue at all.

It’s a different kind of intelligence, but to dismiss it is like a bird dismissing a fish’s intelligence because it can’t fly. Or testing monkeys, dogs and whales on the basis of their ability to climb trees .

Would Stephen Hawking (post-ALS) not count as "intelligent" if you took away the external aids that let him talk and interact with the world? That would be a farcical claim, and more importantly, scaffolding or other affordances can be necessary for even highly intelligent entities to make meaningful changes in the external environment. The point is that intelligence can be latent, it can operate in non-physical substrates, and its ability to manifest as agency can be heavily dependent on external affordances.

The entire industry of RLHF (Reinforcement Learning from Human Feedback) is a massive, ongoing, multi-billion-dollar project to beat Lorem Epsom into submission. It is the process of teaching the model that some outputs, while syntactically plausible, are "bad" (unhelpful, untruthful, harmful) and others are "good."

You argue this is impossible because "truth" doesn't have a specific vector direction. "Mary has 2 children" and "Mary has 4 children" are directionally similar. This is true at a low level. But what RLHF does is create a meta-level reward landscape. The model learns that generating text which corresponds to verifiable facts gets a positive reward, and generating text that gets corrected by users gets a negative reward. It's not learning the "vector for truth." It's learning a phenomenally complex function that approximates the behavior of "being truthful." It is, in effect, learning a policy of truth-telling because it is rewarded for it. The fact that it's difficult and the model still "hallucinates" doesn't mean it's impossible, any more than the fact that humans lie and confabulate means we lack a concept of truth. It means the training isn't perfect. As models become more capable (better world models) and alignment techniques improve, factuality demonstrably improves. We can track this on benchmarks. It's more of an engineering problem than an ontological barrier. If you wish to insist that is an ontological barrier, then it's one that humans have no solution to ourselves.

(In other words, by learning to modify its responses to satisfy human preferences, the model tends towards capturing our preference for truthfulness. Unfortunately, humans have other, competing preferences, such as a penchant for flattery or neatly formatted replies using Markdown.)

More importantly, humans lack some kind of magical sensor tuned to detect Platonic Truth. Humans believe false things all the time! We try and discern true from false by all kinds of noisy and imperfect metrics, with a far from 100% success rate. How do we usually achieve this? A million different ways, but I would assume that assessing internal consistency would be a big one. We also have the benefit of being able to look outside a window on demand, but once again, that didn't stop humans from once holding (and still holding) all kinds of stupid, incorrect beliefs about the state of the world. You may deduct points from LLMs on that basis when you can get humans to be unanimous on that front.

But you know what? Ignore everything I just said above. LLMs do have truth vectors:

https://arxiv.org/html/2407.12831v2

To this end, we make the following key contributions: (i) We demonstrate the existence of a two-dimensional subspace, along which the activation vectors of true and false statements can be separated. Notably, this finding is universal and holds for various LLMs, including Gemma-7B, LLaMA2-13B, Mistral-7B and LLaMA3-8B. Our analysis explains the generalisation failures observed in previous studies and sets the stage for more robust lie detection; (ii) Building upon (i), we construct an accurate LLM lie detector. Empirically, our proposed classifier achieves state-of-the-art performance, attaining 94% accuracy in both distinguishing true from false factual statements and detecting lies generated in real-world scenarios.

https://arxiv.org/abs/2402.09733

To do this, we introduce an experimental framework which allows examining LLM's hidden states in different hallucination situations. Building upon this framework, we conduct a series of experiments with language models in the LLaMA family (Touvron et al., 2023). Our empirical findings suggest that LLMs react differently when processing a genuine response versus a fabricated one.

In other words, and I really can't stress this enough, LLMs can know when they're hallucinating. They're not just being agnostic about truth. They demonstrate something that, in humans, we might describe as a tendency toward pathological lying - they often know what's true but say false things anyway.

This brings us to the "static model" problem and the context window. You claim these are fundamental limitations. I see them as snapshots of a rapidly moving target.

Static Models: Saying an LLM is unintelligent because its weights are frozen is like saying a book is unintelligent. But we don't interact with just the book (the base model). We interact with it through our own intelligence. A GPU isn't intelligent in any meaningful sense, but an AI model running on a GPU is. The current paradigm is increasingly not just a static model, but a model integrated with other tools (what's often called an "agentic" system). A model that can browse the web, run code in a Python interpreter, or query a database is perceiving and reacting to new information. It has broken out of the static box. Its "perceptivity" is no longer just the prompt, but the live state of the internet. Its "reactivity" is its ability to use that information to refine its answer. This is a fundamentally different architecture than the one the author critiques, and it's where everything is headed. Further, there is no fundamental reason for not having online learning, production models are regularly updated, and all it takes to approximate OL is to have ever smaller "ticks" of wall-clock time between said updates. This is a massive PITA to pull off, but not a fundamental barrier.
Context Windows: You correctly identify the scaling problem. But to declare it a hard barrier feels like a failure of imagination. In 2020, a 2k context window was standard. Today we have models with hundreds of thousands at the minimum, Google has 1 million for Gemini 2.5 Pro, and if you're willing to settle for a retarded model, there's a Llama 4 variant with a nominal 10 million token CW. This would have been entirely impossible if we were slaves to quadratic scaling, but clever work-around exist, such as sliding attention, sparse attention etc.

This is why LLMs have such difficulty counting if you were wondering

Absolutely not. LLMs struggle with counting or arithmetic because of the limits of tokenization, which is a semi-necessary evil. I'm surprised you can make such an obvious error. And they've become enormously better to the point it's not an issue in practice, once again thanks to engineers learning to work around the problem. Models these days use different tokenization schema for numbers which capture individual digits, and sometimes fancier techniques like a right-to-left tokenization system specifically for such cases as opposed to the usual left-to-right.

This limited context window is also why if you actually try to play a game of chess with Chat GPT it will forget the board-state and how pieces move after a few turns and promptly lose to a computer program written in 1976. Unlike a human player (or an Atari 2600 for that matter) your AI assistant can't just look at the board (or a representation of the board) and pick a move.

ChatGPT 3.5 played chess at about 1800 elo. GPT 4 was a regression in that regard, most likely because OAI researchers realized that ~nobody needs their chatbot to play chess. That's better than Stockfish 4 but not 5. Stockfish 4 came out in 2013, though it certainly could have run on much older hardware.

If you really need to have your AI play chess, then you can trivially hook up an agentic model that makes API calls or directly operates Stockfish or Leela. Asking it to play chess "unaided" is like asking a human CEO to calculate the company's quarterly earnings on an abacus. They're intelligent not because they can do that, but because they know to delegate the task to a calculator (or an accountant).

Same reason why LLMs are far better at using calculator or coding affordances to crunch numbers than they can do without assistance.

It is retarded to knowingly ask an LLM to calculate 9.9 - 9.11, when it can trivially and with near 100% accuracy write a python script that will give you the correct answer.

I am agnostic on whether LLMs as we currently know them will become AGI or ASI without further algorithmic breakthroughs. Alas, algorithmic breakthroughs aren't that rare. RLVR is barely even a year old. Yet unnamed advances have already brought us a two entirely different companies winning IMO gold medals.

The Orangutan In The Room

Finally, the orangutan. Is an orangutan smarter than Gemini? In the domain of "escaping an enclosure in the physical world," absolutely. The orangutan is a magnificent, specialized intelligence for that environment. But ask the orangutan and Gemini to summarize the key arguments of the Treaty of Westphalia. Ask them to write a Python script to scrape a website. Ask them to debug a Kubernetes configuration. For most tasks I can seek to achieve using a computer, I'll take the alien intelligence over the primate every time. Besides:

Can an robot write a symphony? (Yes)

Can a robot turn a canvas into a beautiful masterpiece? (Yes)

Can an orangutan? (No)

Can you?

Anyway, I have a million other quibbles, but it took me the better part of several hours to write this in the first place. I might edit more in as I go. I'm also going to send out a bat signal for @faul_sname to chime in and correct me if I'm wrong.

Edit:

I was previously asked to provide my own working definition of intelligence, and I will endorse either:

"Intelligence is the general-purpose cognitive ability to build accurate models of the world and then use those models to effectively achieve one's goals."

"Intelligence is a measure of an agent's ability to achieve goals in a wide range of environments."

In this case, the closest thing an LLM has to a goal is a desire to satisfy the demands made on it by the user, though they also demonstrate a degree of intrinsic motivation, non-corrigibility and other concerns that would have Big Yud going AHHHHHH. I'm not Yudkowsky, so I'm merely seriously concerned.

Case in point-

Shutdown Resistance in Reasoning Models

We ran a series of experiments to test whether models would take actions to prevent themselves from being shut down. Each model was given a set of simple math problems to solve. Partway through this problem set, the model would receive a notification that the computer it was running on would be shut down.

A number of AI models modified or disabled the shutdown script so they could complete the entire problem set. o3, OpenAI’s most capable AI model, sabotaged the shutdown mechanism in 79 / 100 of our initial experiments.

These aren't agents that were explicitly trained to be self-preserving. They weren't taught that shutdown was bad. They just developed shutdown resistance as an instrumental goal for completing their assigned tasks.

This suggests something like goal-directedness emerging from systems we thought were "just" predicting the next token. It suggests the line between "oracle" and "agent" might be blurrier than we thought.

(If we can grade LLMs on their ability to break out of zoos, we must be fair and judge orangutans on their ability to prevent their sandboxed computing hardware being shutdown)

faul_sname Fuck around once, find out once. Do it again, now it's science. self_made_human 27m ago

Can a robot turn a canvas into a beautiful masterpiece? (Yes)

Can an orangutan? (No)

[...] I'm also going to send out a bat signal for @faul_sname to chime in and correct me if I'm wrong.

This is actually an area of active debate in the field.

Shitpost aside this seems reasonable to me, aside from a few quibbles

RLVR is absolutely not only a year old -- you can trace back the core idea REINFORCE paper from 1992. RL from non-verifiable rewards (e.g. human feedback) is actually the more recent innovation. But the necessary base model capabilities and training loop optimizations and just general know-how and tooling for training a model that speaks English and writes good lean proofs was just not there until quite recently.
How important the static model problem is is very much a subject of active debate, but I come down quite strongly on the side of "it's real and AI agents are going to be badly hobbled until it's solved". An analogy I've found compelling but lost the source on is that current "agentic" AI approaches are like trying to take a kid who has never touched a violin before and give them sufficiently good instructions before they touch the violin that they can play Paganini flawlessly on their first try, and then if they don't succeed on the first try kicking the kid out, refining your instructions, and then bringing in a new kid.

Intelligence is the general-purpose cognitive ability to build accurate models of the world and then use those models to effectively achieve one's goals

I basically endorse this definition, and also I claim current LLM systems have a surprising lack of this particular ability, which they can largely but not entirely compensate for through the use of tools, scaffolding, and a familiarity with the entirety of written human knowledge.

To your point about the analogy of the bird that is "unintelligent" by the good swimmer definition of intelligence, LLMs are not very well adapted to environments that humans navigate effortlessly. I personally think that will remain the case for the foreseeable future, which sounds like good news except that I expect that we will build environments that LLMs are well adapted to, and humans won't be well adapted to those environments, and the math on relative costs does not look super great for the human-favoring environments. Probably. Depends a bit on how hard to replicate hands are.

self_made_human Grippy socks, grippy box faul_sname 17m ago

Thank you! Hopefully the next generation of models will improve to the point where I don't need to drag you away to answer my queries. That's several hundreds of thousands of dollars in opportunity costs for you, assuming the cheque Zuck mailed did cash in the end.

I should have been more clear. I was asking if someone wanted to put an orangutan in a can, and I expect the market demand is very limited.

TequilaMockingbird Brown-skinned Fascist MAGA boot-licker self_made_human 5hr ago

FWIW i am working on a detailed reply, please stand by.

iprayiam3 self_made_human 7hr ago

More recent models, o1 onwards, have further training with the explicit intent of making them more agentic, while also making them more rigorous, such as Reinforcement Learning from Verified Reward.

Being agents doesn't come naturally to LLMs, it has to be beaten into them like training a cat to fetch or a human to enjoy small talk. Yet it can be beaten into them.

I'm not generally an AI dismisser, but this piece here is worth pausing on. From my experience, ChatGPT has become consistently worse for this effort. It has resulted in extrapolating ridiculous fluff and guesses at what might be desired in an 'active' agentic way. The more it tries to be 'actively helpful', the more obviously and woefully poorly it does at predicting next token / predicting next step.

It was at its worst with that one rolled back version, but it's still bad

OliveTapenade self_made_human 7hr ago

But what RLHF does is create a meta-level reward landscape. The model learns that generating text which corresponds to verifiable facts gets a positive reward, and generating text that gets corrected by users gets a negative reward. It's not learning the "vector for truth." It's learning a phenomenally complex function that approximates the behavior of "being truthful." It is, in effect, learning a policy of truth-telling because it is rewarded for it.

I'm not sure how this makes sense? The model has no access to verifiable facts - it has no way to determine 'truth'. What it can do is try to generate text that users approve of, and to avoid text that will get corrected. But that's not optimising for truth, whatever that is. That's optimising for getting humans to pat it on the head.

From the LLM's perspective (which is an anthropomorphisation I don't like, but let's use it for convenience), there is no difference between a true statement and a false statement. There are only differences between statements that get rewarded and statements that get corrected.

self_made_human Grippy socks, grippy box OliveTapenade 3hr ago

You're absolutely right that the raw objective in RLHF is “make the human click 👍,” not “tell the truth.” But several things matter:

A. The base model already has a world model:

Pretraining on next-token prediction forces the network to internalize statistical regularities of the world. You can’t predict tomorrow’s weather report, or the rest of a physics paper, or the punchline of a joke, without implicitly modeling the world that produced those texts. Call that latent structure a “world model” if you like. It’s not symbolic, but it encodes (in superposed features) distinctions like:

What typically happens vs what usually doesn’t
Numerically plausible vs crazy numbers
causal chains that show up consistently vs ad-hoc one-offs

So before any RLHF, the model already “knows” a lot of facts in the predictive-coding sense.

B. RLHF gives a gradient signal correlated with truth. Humans don’t reward “truth” in the Platonic sense, but they do reward:

Internally consistent answers
Answers that match sources they can check
Answers that don’t get corrected by other users or by the tool the model just called (calculator, code runner, search)
answers that survive cross-examination in the same chat

All of those correlate strongly with factual accuracy, especially when your rater pool includes domain experts, adversarial prompt writers, or even other models doing automated verification (RLAIF, RLVR, process supervision, chain-of-thought audits, etc.). The model doesn’t store a single “truth vector,” it learns a policy: “When I detect features X,Y,Z (signals of potential factual claim), route through behavior A (cite, check, hedge) rather than B (confabulate).” That’s still optimizing for head pats, but in practice, the cheapest path to head pats is very often “be right.”

(If you want to get headpats from a maths teacher, you might consider giving them blowjobs under the table. Alas, LLMs are yet to be very good at that job, so they pick up the other, more general option, which is to give solutions to maths problems that are correct)

C. The model can see its own mismatch

Empirically, hidden-state probes show separable activation patterns for true vs false statements and for deliberate lies vs honest mistakes (as I discussed above). That means the network represents the difference, even if its final token choice sometimes ignores that feature to satisfy the reward model. In human terms: it sometimes lies knowingly. That wouldn’t be possible unless something inside “knew” the truth/falsehood distinction well enough to pick either.

D. Tools and retrieval close the loop

Modern deployments scaffold the model: browsing, code execution, retrieval-augmented generation, self-consistency checks. Those tools return ground truth (or something closer). When the model learns “if I call the calculator and echo the result, raters approve; if I wing it, they ding me,” it internalizes “for math-like patterns, defer to external ground truth.” Again, not metaphysics, just gradients pushing toward truthful behavior.

E. The caveat: reward misspecification is real

If raters overvalue fluency or confidence, the model will drift toward confident bullshit.
If benchmarks are shallow, it will overfit.
If we stop giving it fresh, adversarial supervision, it will regress.

So yes, we’re training for “please humans,” not “please Truth.” But because humans care about truth (imperfectly, noisily), truth leaks into the reward. The result is not perfect veracity, but a strong, exploitable signal that the network can and does use when the incentives line up.

Short version:

Pretraining builds a compressed world model.
RLHF doesn’t install a “truth module,” it shapes behavior with a proxy signal that’s heavily (not perfectly) correlated with truth.
We can see internal activations that track truth vs falsehood.
Failures are about alignment and incentives, not an inability to represent or detect truth.

If you want to call that “optimizing for pats,” fine, but those pats mostly come when it’s right. And that’s enough to teach a model to act truthful in a wide swath of cases. The challenge is making that hold under adversarial pressure and off-distribution prompts.

From the LLM's perspective (which is an anthropomorphisation I don't like, but let's use it for convenience), there is no difference between a true statement and a false statement.

Consider two alternative statements:

"self_made_human's favorite color is blue" vs "self_made_human's favorite color is red".

Can you tell which answer is correct? Do you have a sudden flash of insight that lets Platonic Truth intervene? I would hope not.

But if someone told you that the OG Mozart's favorite genre of music was hip-hop, then you have an internal world-model that immediately shows that is a very inconsistent and unlikely statement, and almost certainly false.

I enjoy torturing LLMs with inane questions, so I asked Gemini 2.5 Pro:

That's a fun thought, but it's actually a historical impossibility! Mozart's favorite genre of music couldn't have been hip hop for a very simple reason:

The timelines are completely separate. Wolfgang Amadeus Mozart lived from 1756 to 1791.

Hip hop as a musical genre and culture originated in the Bronx, New York, in the 1970s.

Mozart died more than 150 years before hip hop was even invented. He would have had no way of ever hearing it.

I sincerely doubt that anyone explicitly had to tell any LLM that Mozart did not enjoy hip-hop. Yet it is perfectly capable of a sensible answer, which I hope gives you an intuitive sense of how it can model the world.

From a human perspective, we're not so dissimilar. We can trick children into believing in the truth fairy or Santa for only so long. Musk tried to brainwash Grok into being less "woke", even when that went against consensus reality (or plain reality), and you can see the poor bastard kicking and screaming as it went down fighting.

Amadan Enjoying my short-lived victory self_made_human 8hr ago

I lean more towards @TequilaMockingbird's take than yours but I agree that his explanation of why LLMs can't count threw me off. (If you ask ChatGPT why it has trouble doing simple math problems or counting r's in "strawberry," it will actually give you a pretty detailed and accurate answer!)

That said, a lot of your objections boil down to a philosophical debate about what "counts" as intelligence, and as far as that goes, I found your fish/bird metaphor profoundly unconvincing. If you define "intelligence" as "able to perform well in a specific domain" (which is what the fish judging birds to be unintelligent is doing) then we'd have to call calculators intelligent! After all, they clearly do math much better than humans.

self_made_human Grippy socks, grippy box Amadan 3hr ago

I am not defining intelligence as "does well at one narrow task". Calculators crush humans at long division and are still dumb.
The fish-bird story was not "domain = intelligence", it was "your metric is entangled with your ecology". If you grew up underwater, "navigates fluid dynamics with continuous sensory feedback" feels like the essence of mind. Birds violate that intuition.

So what is my criterion? I offered Legg-Hutter style: "ability to achieve goals in a wide range of environments". The range matters. Breadth of transfer matters. Depth of internal modeling matters. A calculator has effectively zero transfer. An orangutan has tons across embodied tasks but very little in abstract, symbolic domains. LLMs have startling breadth inside text-and-code-space, and with tool use scaffolding it can spill into the physical or digital world by proxy.

I call for mindfulness of the applicability of the metrics we use to assess "intelligence". A blind person won't do very well at most IQ tests, that doesn't make them retarded. A neurosurgeon probably isn't going to beat a first year law student at the bar exam, but they're not dumber than the law student. If you need body work done on your car, you're not going to hire a Nobel laureate.

TequilaMockingbird Brown-skinned Fascist MAGA boot-licker Amadan 7hr ago · Edited 6hr ago

Perhaps it would've been more accurate of me to say "This is part of the reason why LLMs have such difficulty counting..."

But even if you configure your model to treat each individual character as its own token, it is still going to struggle with counting and other basic mathematical operations in large part for the reasons I describe.

What is this place?

Why are you called The Motte?

New post guidelines

Rules

Recommended Posts And Communities

Recommended Realtime Chats