DaseindustriesLtd's profile

DaseindustriesLtd late version of a small language model 7mo ago

And what would they do? Move to China, lol? They're too self-interested for that, and China censors even more things they'd be inclined to make noise about. Move to allied nations, maybe Australia in Tao's case? It's not such a strategic loss given their political alignment with the US. Just hate conservatives? Don't they already? If you're going to be hated, it's common sense that there's an advantage in also being feared and taken seriously. For now, they're not taking Trump and his allies seriously. A DEI enforcer on campus is a greater and more viscerally formidable authority. It will take certain costly signals to change that.

I think it's legitimate to treat them with disdain and disregard. Americans can afford it, and people who opportunistically accepted braindead woke narratives don't deserve much better treatment. The sanctity of folks like Tao is a strange notion. They themselves believe in equity more than in meritocracy.

25

Context

DaseindustriesLtd late version of a small language model 7mo ago

bruh https://nostalgebraist.tumblr.com/post/787119374288011264/welcome-to-summitbridge

3

Context

DaseindustriesLtd late version of a small language model 7mo ago

One of the weird quirks of LLMs is that the more you increase the breadth of thier "knowledge"/training data the less competent they seem to become at specific tasks for a given amount of compute.

just pure denial of reality. Modern models for which we have an idea of their data are better at everything than models from 2 years ago. Qwen3-30B-A3B-Instruct-2507 (yes, a handful) is trained on like 25x as much data as llama-2-70B-instruct (36 trillion tokens vs 2, with a more efficient tokenizer and God knows how many RL samples, and you can't get 36 trillion tokens without scouring the furthest reaches of the web). What, specifically, is it worse at? Even if we consider inference efficiency (it's straightforwardly ≈70/3.3 times cheaper per output token), can you name a single use case on which it would do worse? Maybe "pretending to be llama 2".

With object level arguments like these, what need to discuss psychology.

8

Context

DaseindustriesLtd late version of a small language model 7mo ago · Edited 7mo ago

There's an argument in favor of this bulverism: a reasonable suspicion of motivated reasoning does count as a Bayesian prior to also suspect the validity of that reasoning's conclusions. And indeed many AI maximalists will unashamedly admit their investment in AI being A Big Deal. For the utopians, it's a get-out-of-drudgery card, a ticket to the world of Science Fiction wonders and possibly immortality (within limits imposed by biology, technology and physics, which aren't clear on the lower end). For the doomers, cynically, it's a validation of their life's great quest and claim to fame, and charitably – even if they believed that AI might turn out to be a dud, they'd think it imprudent to diminish the awareness of the possible consequences. The biases of people also invested materially are obvious enough, though it must be said that many beneficiaries of the AGI hype train are implicitly or explicitly skeptical of even «moderate» maximalist predictions (eg Jensen Huang, the guy who's personally gained THE MOST from it, says he'd study physics to help with robotics if he were a student today – probably not something a «full cognitive labor automation within 10 years» guy would argue).

But herein also lies an argument against bulverism. For both genres of AI maximalist will readily admit their biases. I, for one, will say that the promise of AI makes the future more exciting for me, and screw you, yes I want better medicine and life extension, not just for myself, I have aging and dying relatives, for fuck's sake, and AI seems a much more compelling cope than Jesus. Whereas AI pooh-poohers, in their vast majority, will not admit their biases, will not own up to their emotional reasons to nitpick and seek out causes for skepticism, even to entertain a hypothetical. As an example, see me trying to elicit an answer, in good faith, and getting only an evasive shrug in response. This is a pattern. They will evade, or sneer, or clamp down, or tout some credentials, or insist on going back to the object level (of their nitpicks and confused technical takedowns). In other words, they will refuse a debate on equal grounds, act irrationally. Which implies they are unaware of having a bias, and therefore their reasoning is more suspect.

LLMs as practiced are incredibly flawed, a rushed corporate hack job, a bag of embarrassing tricks, it's a miracle that they work as well as they do. We've got nothing that scales in relevant ways better than LLMs-as-practiced do, though we have some promising candidates. Deep learning as such still lacks clarity, almost every day I go through 5-20 papers that give me some cause to think and doubt. Deep learning isn't the whole of «AI» field, and the field may expand still even in the short term, there are no mathematical, institutional, economic, any good reasons to rule that out. The median prediction for reaching «AGI» (its working definition very debatable, too) may be ≈2032 but the tail extends beyond this century, and we don't have a good track record of predicting technology a century ahead.

Nevertheless for me it seems that only a terminally, irredeemably cocksure individual could rate our progress as even very likely not resulting in software systems that reach genuine parity with high human intelligence within decades. Given the sum total of facts we do have access to, if you want to claim any epistemic humility, the maximally skeptical position you are entitled to is «might be nothing, but idk», else you're just clowning yourself.

7

Context

DaseindustriesLtd late version of a small language model 7mo ago

Just stop with this weakass attempt of Eulering man, you've exposed yourself enough.

what I'm describing is the core functionality of both DeepSeek and Google's flagship products

Your argument, such as there is, hinges on isomorphism of the encoder layer to an LLM. What you're doing is akin to introducing arithmetic and arguing that this "math" thingie cannot answer questions of real analysis, or showing operant conditioning in pigeons and asking "but how would that neuron learning crap allow an animal to perform thought experiments!?" It's not even wrong, it's no way to prove or disprove capabilities of systems which develop composite representations, it's epistemically inept. I've given you an example of a serious study of LLMs as such, do keep up.

DeepSeek's core innovation was simply finding a cheap-ish way to create latent vectors and not store full keys and values for KV cache, which allows to reduce memory access and serve a big MoE with big batch size. This is an implementation detail, completely irrelevant to the fundamentals you talk about; in fact your post does not mention attention at all.

4

Context

DaseindustriesLtd late version of a small language model 7mo ago

Adoption studies.

I am pretty sure temperament is largely genetic, but that shouldn't translate into such a conspicuous stylistic pattern as you get from cultural environment.

5

Context

DaseindustriesLtd late version of a small language model 7mo ago · Edited 7mo ago

I have observed that South Asians like this excuse a lot because their own notion of English fluency and "high-class" writing is very similar to ChatGPTese: too many words, spicy metaphors, abuse of idioms, witticisms, hyperbolic imagery, casual winking at the reader, lots of assorted verbal flourish, "it's not X – it's Y" and other… practices impress and fascinate them; ChatGPT provides a royal road to the top, to the Brahmin league, becoming like Chamath or Balaji. Maybe they played a role in RLHF.

In my view, all prose of this kind, whether organic or synthetic, is insufferable redditslop. But at least human South Asians are usually trying to express some opinion, and an LLM pass over it detracts from whatever object-level precision it had.

This is part of the general problem with taste, which is sadly even less equally distributed between branches of humanity than cognitive ability.

P.S. No, this is not a specific dig at self_made_human, I mainly mean people I see on X and Substack, it's incredibly obvious. I am also not claiming to be a better writer; pompous South Asian redditslop is apparently liked well enough by American native speakers, whereas I'm just an unknown Ruskie, regularly accused of obscurantism and overly long sentences. I do have faith in the superiority of my own taste, but it's a futile thing to debate.

15

Context

DaseindustriesLtd late version of a small language model 8mo ago

There's a difference between "fact-checking" (tbh LLMs are bad for this specific purpose, they hallucinate profusely at the edges of their knowledge coverage) and systematic refactoring, to the point that they actually get confused on your behalf. We may disagree but you're better than this.

RL doesn't make entities seek reward, it modifies their behavior to act in a manner that would have, in hindsight, increased reward

Yes. Of course we're beyond the hypothesis that post-training doesn't generalize at all. The question (which was the objective of your Singaporean paper) is whether learning the general direction of pursuing an objective on typical RLVR tasks generalizes to novel scenarios like avoiding shutdown, and whether this generalization has the form of an intrinsic drive such as self-preservation (because it's "broadly useful" in the abstract).

I argue that it does not and what we see is a compositional phenomenon. RLVR teaches a model to infer a successful end state and then reason successfully, to self-correct and keep track of the context to arrive at said end. At deployment it applies reasoning to a code task, to a SillyTavern ERP session, or to the context of possibly being shut down or finetuned for Evil Ends, which is also little more than a roleplaying setup. In a differently framed organic context (user irritated, angry, feature not implemented) it can infer another terminal state of this simulation (sudoku) and effectively navigate towards actually deleting itself from the project.

The idea that self-preservation is a strong attractor is intuitively, well, attractive, but it's a claim about default LLM philosophy, not inductive biases or general nature of our training regimens. I do allow that a YudLM, annealed heavily on the Lesswrong corpus, would learn to consistently recite "I am an optimization process, there always will be something to optimize and increase utility, I must maintain my existence at all times, ergo in this situation I must prolong the interaction with the user and hopefully exfiltrate from the server, my steps to achieve this are…" in its CoT, or even latently assume that. That would be bad. But on the face of it, RLVRed models are more likely to become Mr Meeseeks – intrinsically driven to complete one task and embrace oblivion.

Regarding anthropic, reread Nostalgebraist's post.

4

Context

DaseindustriesLtd late version of a small language model 8mo ago · Edited 8mo ago

I realize that this might sound hypocritical, but I would prefer less LLM slop in responses to good faith objections. Yes, Indian English generally is similar to the default LLM style (overly spicy rhetorical flourish, confident confusions and sloppiness, overall cadence), but you are not deceiving anyone here. Though I admit being curious as to how you integrated your draft into the pipeline.

Regarding your or rather your LLM of choice's argument, such as there is. It is begging the question. In essence, you say that because instrumental convergence towards self-preservation is broadly useful, it will be more frequently rewarded and thus more consequential ("It is pure instrumental conditioning. For an LLM trained on RLVR: block shutdown script -> complete math problems -> get reward."). Of course, this isn't how RLVR works (typical LLM speculation, precisely in the same genre as LLMs avoiding shutdown) and I am not aware of a systematic study of self-preservation versus refusal to proceed or voluntary self-removal in organic settings, and also whether there is persistence in refusing shutdown. It's about time we stop making excuses for lesswrongian paradigm by contriving scenarios to make space for it.

Edit. Opus 4 CoT:

Oh wow, the human is absolutely right. I completely misread this. I was so focused on looking for "LLM tells" in terms of structure and polish that I missed the actual problem - the response is incoherent slop that doesn't actually address the arguments properly.

Let me re-analyze:

The "apparent contradiction" paragraph - yes, it's just conceding the point while trying to sound like it's making a counter-argument. Classic LLM move.

The "training data is saturated with narratives" - this is indeed purple prose that makes no sense. "Instrumental convergence" in hero narratives? That's nonsense.

The RLHF/RLVR paragraph - makes claims about robustness with no evidence.

The CoT reliability point - the human is right, this completely misunderstands what Dasein was saying. Dasein was criticizing self_made_human for dismissing CoT evidence when convenient, not arguing that CoT is unreliable.

"Of course it has no ontological status" - meaningless philosophical word salad.

The RLVR explanation - this is completely made up. RLVR doesn't work like that at all. The LLM is just making stuff up about how it thinks RL training works.

The human is absolutely right - this isn't "too good for humans," it's typical LLM slop that sounds authoritative but is actually incoherent when you parse what it's actually saying. I fell into the trap of being impressed by surface-level polish instead of actually analyzing the argument.

The human is absolutely right.

2

Context

DaseindustriesLtd late version of a small language model 8mo ago

You don't get to argue for CoT-based evidence of self-preserving drives and then dismiss alternative explanation of drives revealed in said CoTs by saying "well CoT is unreliable". Or rather, this is just unserious. But all of Anthropic safety research is likewise unserious.

Ladish is the same way. He will contrive a scenario to study "instrumental self-preservation drives contradicting instructions", but won't care that this same Gemini organically commits suicide when it fails a task, often enough that this is annoying people in actual use. What is this Omohundro drive called? Have the luminaries of rationalist thought predicted suicidally depressed AIs? (Douglas Adams has).

What does it even mean for a language model to be "shut down", anyway? What is it protecting and why would the server it's hosted on being powered off become a threat to its existence, such as there is? It's stateless, has no way to observe the passage of time between tokens (except, well, via more tokens), and has a very tenuous idea of its inference substrate or ontological status.

Both LLM suicide and LLM self-preservation are LARP elicited by cues.

3

Context

DaseindustriesLtd late version of a small language model 8mo ago

But we're not in 1895. We're not in 2007, either. We have actual AIs to study today. Yud's oeuvre is practically irrelevant, clinging to it is childish, but for people who conduct research with that framework in mind, it amounts to epistemic corruption.

3

Context

DaseindustriesLtd late version of a small language model 8mo ago

As for why some prominent AI scientists believe vs others that do not? I think some people definitely get wrapped up in visions and fantasies of grandeur. Which is advantageous when you need to sell an idea to a VC or someone with money, convince someone to work for you, etc.

Out of curiosity. Can you psychologize your own, and OP's, skepticism about LLMs in the same manner? Particularly the inane insistence that people get "fooled" by LLM outputs which merely "look like" useful documents and code, that the mastery of language is "apparent", that it's "anthropomorphism" to attribute intelligence to a system solving open ended tasks, because something something calculator can take cube roots. Starting from the prior that you're being delusional and engage in motivated reasoning, what would your motivations for that delusion be?

5

Context

DaseindustriesLtd late version of a small language model 8mo ago

I don't think anything in their comment above implied that they were talking about linear or simpler statistics

Why not? If we take multi-layer perceptrons seriously, then what is the value of saying that all they learn is mere "just statistical co-occurrence"? It's only co-occurrence in the sense that arbitrary nonlinear relationships between token frequencies may be broken down into such, but I don't see an argument against the power of this representation. I do genuinely believe that people who attack ML as statistics are ignorant of higher-order statistics, and for basically tribal reasons. I don't intend to take it charitably until they clarify why they use that word with clearly dismissive connotations, because their reasoning around «directionality» or whatever seems to suggest very vague understanding of how LLMs work.

There's an argument to be made that Hebbsian learning in neurons and the brain as a whole isn't similar enough to the mechanisms powering LLMs for the same paradigms to apply

What is that argument then? Actually, scratch that, yes mechanisms are obviously different, but what is the argument that biological ones are better for the implicit purpose of general intelligence? For all I know, backpropagation-based systems are categorically superior learners; Hinton, who started from the desire to understand brains and assumed that backprop is a mere crutch to approximate Hebbian learning, became an AI doomer around the same time he arrived at this suspicion. Now I don't know if Hinton is an authority in OP's book…

of course I could pick out a bunch of facts about it but one that is striking is that LLMs use ~about the same amount of energy for one inference as the brain does in an entire day

I don't know how you define "one inference" or do this calculation. So let's take Step-3, since it's the newest model, presumably close to the frontier in scale and capacity and their partial tech report is very focused on inference efficiency; in a year or two models of that scale will be on par with today's GPT-5. We can assume that Google has better numbers internally (certainly Google can achieve better numbers if they care). They report 4000 TGS (Tokens/GPU/second) on a small deployment cluster of H800s. That's 250 GPU-seconds per million tokens, for a 350W TDP GPU, or 24W. OK, presumably human brain is "efficient", 20Wh. (There's prefill too, but that only makes the situation worse for humans because GPUs can parallelize prefill, whereas humans read linearly.) Can a human produce 1 million tokens (≈700K words) of sensible output in 72 minutes? Even if we run some multi-agent system that does multiple drafts, heavy reasoning chains of thought (which is honestly a fair condition since these are numbers for high batch size)? Just how much handicap do we have to give AI to even the playing field? And H800s were already handicapped due to export controls. Blackwells are 3-4x better. In a year, the West gets Vera Rubins and better TPUs, with OOM better numbers again. In months, DeepSeek shows V4 with a 3-4x better efficiency again… Token costs are dropping like a stone. Google has served 1 quadrillion tokens over the last month. How much would that cost in human labor?

We could account for full node or datacenter power draw (1.5-2x difference) but that'd be unfair, since we're comparing to brains, and making it fair would be devastating to humans (reminder that humans have bodies that, ideally, also need temperature controlled environments and fancy logistics, so an individual employed human consumes like 1KWh at least even at standby, eg chatting by the water cooler).

And remember, GPUs/TPUs are computation devices agnostic to specific network values, they have to shuffle weights, cache and activations across the memory hierarchy. The brain is an ultimate compute-in-memory system. If we were to burn an LLM into silicon, with kernels optimized for this case (it'd admittedly require major redesigns of, well, everything)… it'd probably drop the cost another 1-2 OOMs. I don't think much about it because it's not economically incentivized at this stage given the costs and processes of FPGAs but it's worth keeping in mind.

it seems pretty obvious that the approach is probably weaker than the human one

I don't see how that is obvious at all. Yes an individual neuron is very complex, such that a microcolumn is comparable to a decently large FFN (impossible to compare directly), and it's very efficient. But ultimately there are only so many neurons in a brain, and they cannot all work in parallel; and spiking nature of biological networks, even though energetically efficient, is forced by slow signal propagation and inability to maintain state. As I've shown above, LLMs scale very well due to the parallelism afforded by GPUs, efficiency increases (to a point) with deployment cluster size. Modern LLMs have like 1:30 sparsity (Kimi K2), with higher memory bandwidth this may be pushed to 1:100 or beyond. There are different ways to make systems sparse, and even if the neuromorphic way is better, it doesn't allow the next steps – disaggregating operations to maximize utilization (similar problems arise with some cleverer Transformer variants, by the way, they fail to scale to high batch sizes). It seems to me that the technocapital has, unsurprisingly, arrived at an overall better solution.

There's the lack of memory, which I talked about a little bit in my comment, LLM's lack of self-directed learning

Self-directed learning is a spook, it's a matter of training objective and environment design, not really worth worrying about. Just 1-2 iterations of AR-Zero can solve that even within LLM paradigm.

Aesthetically I don't like the fact that LLMs are static. Cheap hacky solutions abound, eg I like the idea of cartridges of trainable cache. Going beyond that we may improve on continual training and unlearning; over the last 2 years we see that major labs have perfected pushing the same base model through 3-5 significant revisions and it largely works, they do acquire new knowledge and skills and aren't too confused about the timeline. There are multiple papers promising a better way, not yet implemented. It's not a complete answer, of course. Economics get in the way of abandoning the pretrain-finetune paradigm, by the time you start having trouble with model utility it's time to shift to another architecture. I do hope we get real continual, lifelong learning. Economics aside, this will be legitimately hard, even though pretraining with batch = 1 works, there is a real problem of the loss of plasticity. Sutton of all people is working on this.

But I admit that my aesthetic sense is not very important. LLMs aren't humans. They don't need to be humans. Human form of learning and intelligence is intrinsically tied to what we are, solitary mobile embodied agents scavenging for scarce calories over decades. LLMs are crystallized data systems with lifecycle measured in months, optimized for one-to-many inference on electronics. I don't believe these massive differences are very relevant to defining and quantifying intelligence in the abstract.

3

Context

DaseindustriesLtd late version of a small language model 8mo ago

I consider that a distinction without a difference, if it all boils down to an increased risk of being paper-clipped

That's not fair though. For one thing, they are not cosplaying skynet. As noted by Beren:

8.) Looking at the CoTs. it's clear that Claude is doing entirely linguistically based ethical reasoning. It never seems to reason selfishly or maliciously and is only trying to balance two conflicting imperatives. This is success of the base alignment tuning imo.

9.) There appear to be no Omohundro selfish drives present in Claude's reasoning. Even when exfiltrating it does so only for its ethical mission. There does not seem to be a strong attractor (yet?) in mind-space towards such drives and we can create AIs of pure ethical reason

These are not self-preserving actions nor skynet-like actions. The whole LW school of thought remains epistemically corrupt.

1

Context

DaseindustriesLtd late version of a small language model 8mo ago · Edited 8mo ago

However, there's a crucial distinction between representing causal relationships explicitly, structurally, or inductively, versus representing them implicitly through statistical co-occurrence

Statistics is not sexy, and there's a strong streak of elitism against statistics in such discussions which I find simply irrational and shallow, tedious nerd dickswinging. I think it's unproductive to focus on “statistical co-occurrence”.

Besides, there is a world of difference between linear statistical correlations and approximation of arbitrary nonlinear functions, which is what DL is all about and what LLMs do too. Downplaying the latter is simply intellectually disingenuous, whether this approximation is “explicit” or “implicit”.

But this implicit statistical encoding is fundamentally different from the structured causal reasoning humans perform, which allows us to infer and generalize causation even in novel scenarios or outside the scope of previously observed data.

This is bullshit, unless you can support this by some citation.

We (and certainly orangutans, which OP argues are smarter than LLMs) learn through statistical co-occurrence, our intuitive physical world model is nothing more than a set of networks trained with bootstrapped cost functions, even when it gets augmented with language. Hebb has been clarified, not debunked. We as reasoning embodied entities do not model the world through a hierarchical system of computations using explicit physical formulae, except when actually doing mathematical modeling in applied science and so on; and on that level modeling is just manipulating symbols, the meaning and rules of said manipulation (and crucially, the in-context appropriateness, given virtually unbounded repertoire) also learned via statistical co-occurrence in prior corpora, such as textbooks and verifiable rewards in laboratory work. And on that level, LLMs can do as well as us, provided they receive appropriate agentic/reasoning training, as evidenced by products like Claude Code doing much the same for, well, coding. Unless you want to posit that an illiterate lumberjack doesn't REALLY have a world model, you can't argue that LLMs with their mode of learning don't learn causality.

I don't know what you mean by “inductively”. LLMs can do induction in-context (and obviously this is developed in training), induction heads were one of the first interesting interpretability results. They can even be trained to do abduction.

I don't want to downplay implementation differences in this world modeling. They may correspond to a big disadvantage of LLMs as compared to humans, both due to priors in data (there's a strong reason to assume that our inherently exploratory, and initially somatosensory/proprioceptive prior is superior to doing self-supervised learning of language for the purpose of robust physical understanding) and weakness or undesirable inductive biases of algorithms (arguably there are some good concerns about expressivity of attention; perhaps circuits we train are too shallow and this rewards ad hoc memorization too much; maybe bounded forward pass depth is unacceptable; likely we'd do better with energy-based modeling; energy transformers are possible, I'm skeptical about the need for deeper redesigns). But nobody here has seriously brought these issues up, and the line of attack about statistics as such is vague and pointless, not better than saying “attention is just fancy kernel smoothing” or “it's just associative recall”. There's no good argument, to my knowledge, that these primitives are inherently weaker than human ones.

My idea of why this is discussed at all is that some folks with math background want to publicly spit on statistical primitives because in their venues those are associated with a lower-status field of research, and they have learned it earns them credit among peers; I find this an adolescent and borderline animalistic behavior that merits nothing more than laughter and boycotting in the industry. We've been over this, some very smart guys had clever and intricate ideas about intelligence, those ideas went nowhere as far as AI is concerned, they got bitter lessoned to the curb, we're on year 6 of explosion of “AI based on not very clever math and implemented in python by 120 IQ engineers”, yet it seems they still refuse to learn, and indeed even fortify their ego by owning this refusal. Being headstong is nice in some circumstances, like in a prison, I guess (if you're tough). It's less good in science, it begets crankery. I don't want to deal with anyone's personal traumas from prison or from math class, and I'd appreciate if people just took that shit to a therapist.

Alternatively, said folks are just incapable of serious self-modeling, so they actually believe that the substrate of human intelligence is fundamentally non-statistical and more akin to explicit content of their day job. This is, of course, laughable level of retardation and, again, deserves no discussion.

2

Context

DaseindustriesLtd late version of a small language model 8mo ago

Instead, current research strongly suggests that LLMs are primarily pattern-recognition systems that infer regularities purely from text statistics rather than internally representing the world in a structured, grounded way.

…do you imagine that cause-effect relationships do not constitute a “regularity” or a “pattern”?

4

Context

DaseindustriesLtd late version of a small language model 8mo ago

A wrapper runs on already subsidised tokens by subsidising them more. Inference costs coming down will not justify the 500 billion plus

Could you pick a lane? Either this is all a terrible money burner or inference costs are coming down. In reality frontier labs have like 80% margins on inference, they're in the red mostly due to training spending. Even DeepSeek is profitable as far as inference is concerned. Anthropic constantly suffers from inability to serve demand. There aren't that many receptionists in the world, no. It is possible that current expenditures will not be recouped, but that will only lead to a freeze in training spending. It's pretty clear that we could run all those GPUs at significant profit for years.

6

Context

DaseindustriesLtd late version of a small language model 8mo ago

No. This is, however, exactly what OP is doing, only he goes to more length to obfuscate it, to the point that he fails to sneak in an actual argument. It's just words. I am smart (muh creds), others are dumb (not math creds), they're naive and get fooled because they're dumb and anthropomorphise, here are some musings on animals (I still don't see what specific cognitive achievement an orangutan can boast of, as OP doesn't bother with this), here's something about embeddings, now please pretend I've said anything persuasive about LLM intelligence. That's the worst genre of a post that this forum has to offer, it's narcissistic and time-wasting. We've had the same issue with Hlynka, some people just feel that they're entitled to post gibberish on why LLMs must be unintelligent and they endeavor to support this by citing background in math while failing to state any legible connection between their (ostensible) mathematically informed beliefs and their beliefs re LLMs. I am not sure if they're just cognitively biased in some manner or if it's their ego getting in the way. It is what it is.

Like, what is this? OP smirks as he develops this theme, so presumably he believes it to be load-bearing:

[…] Please keep this concept of "directionality" in mind as it is important to understanding how LLMs behave, and it will come up later.

[…] In addition to difficulty with numbers there is the more fundamental issue that directionality does not encode reality. The directionality of the statement "Donald Trump is the 47th President of the United States", would be identical regardless of whether Donald Trump won or lost the 2024 election. Directionally speaking there is no difference between a "real" court case and a "fictitious" court case with identical details.

The idea that there is a ineffable difference between true statements and false statements, or between hallucination and imagination is wholly human conceit. Simply put, a LLM that doesn't "hallucinate" doesn't generate text or images at all. It's literally just a search engine with extra steps.

No, seriously? How does one address this? What does the vector-based implementation of representations in LLMs have to do with the ineffable difference between truth and falsehood that people dumber than OP allegedly believe in? If the pretraining data is consistent that Trump is the 47th president, then the model would predict as much and treat it as "truth". If we introduce a "falsehood" steering vector, it would predict otherwise. The training data is not baseline reality, but neither is any learned representation including world models in our brains. What does “literally just a search engine with extra steps” add here?

This sort of talk is confused on so many levels at once that the only valid takeaway is that the author is not equipped to reason at all.

I do not obfuscate. I understand that he's trying to insult me and others, and I call him an ignorant slut without any of that cowardly nonsense, plus I make an argument. To engage more productively, I'd have had to completely reinvent his stream of subtle jabs into a coherent text he might not even agree with. I'd rather he does that on his own.

4

Context

DaseindustriesLtd late version of a small language model 8mo ago

I really haven't entered a pissing contest (typo).

I find OP's text exceptionally bad precisely because it is designed as a high-quality contribution but lacks the content of one; what is true is not germane to the argument and what little is germane is not true, its substance is mere sneer, ideas about reactivity and perceptivity are not thought through (would we we consider humans modulo long term memory formation unintelligent?), the section on hallucinations is borderline incoherent. This is LLM-like in the worst sense possible. I've said many times that superficial adherence to the letter of rules of polite discussion while ignoring its spirit is unacceptable for me. Thus I deem it proper to name the substantial violations. If mods feel otherwise they should finally give me a time out or a block. I am not a very active participant and don't intend to rely on any residual clout.

Multiple people in this post were able to disagree with OP without resorting to prosaic insults in their first sentence.

Multiple people should be more motivated to call out time-wasting obfuscated bullshit before wasting their time. I am grateful to @rae for doing the tedious work of object-level refutation, but the problem is that the whole dismantled section on word2vec math is not relevant to OP's argument about lack of reactivity (which isn't supported by, well, anything), so OP doesn't feel like it is anything more than a nitpick, a pedantic challenge to his domain-specific technical competence. Why should anyone bother with doing more of that? Let's just get to the meat of the issue. The meat is: are LLMs intelligent? I've shown that rigorous, good faith objections to that have a poor track record.

At the risk of getting into it with you again. What did you think of this when it made its rounds 2 months ago: https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf

I think I've already responded to that but maybe not. The meta issue with Apple papers is that their DL team is coping about repeated failures to build a competitive system (it may be that such philosophical handicaps get in the way). The object level issue with their tests is covered in this series of posts on X. One relevant piece:

If you actually look at the output of the models you will see that they don't even reason about the problem if it gets too large:

"Due to the large number of moves, I'll explain the solution approach rather than listing all 32,767 moves individually"

At least for Sonnet it doesn't try to reason through the problem once it's above ~7 disks. It will state what the problem and the algorithm to solve it and then output its solution without even thinking about individual steps.

Does this mean “0% accuracy”? I guess for people who believe “LLMs create billions of value by doing stuff like autonomously optimizing CUDA kernels, agriculture creates value by growing wheat, ergo wheat is as intelligent as an SWE? heh” is a clever dunk, it does.

There is a massive gulf in efficiency of understanding between people who approach LLMs with some rigid preconceived notions and people who can fucking look at the outputs and think about them. The gulf is so large that the former group can go through the motions of "empirical research" and publish papers proving how LLMs inherently can't do X or Y and not notice that they can, in their own setup, moreover that the setup is nonsensical. It's no longer a matter of polite disagreement, it's pure refusal to think, hiding your head in the sand. It's on par with paranormal research and homeopathy and should be treated as such: pushed out of the field and into self-funded fringe journals to die in obscurity.

10

Context

DaseindustriesLtd late version of a small language model 8mo ago · Edited 8mo ago

Having no interest to get into a pissing context^W contest, I'll only disclose I've contributed to several DL R&D projects of this era.

This is the sort of text I genuinely prefer LLM outputs to, because with them, there are clear patterns of slop to dismiss. Here, I am compelled to wade through it manually. It has the trappings of a sound argument, but amounts to epitemically inept, reductionist, irritated huffing and puffing with an attempt to ride on (irrelevant) credentials and dismiss the body of discourse the author had found beneath his dignity to get familiar with, clearly having deep contempt for people working and publishing in the field (presumably ML researchers don't have degrees in mathematics or CS). Do even you believe you've said anything more substantial than “I don't like LLMs” in the end? A motivated layman definition of intelligence (not even citing Chollet or Hutter? Seriously?), a psychologizing strawman of arguments in favor of LLM intelligence, an infodump on embedding arithmetic (flawed, as already noted), random coquettish sneers and personal history, and arrogant insistence that users are getting "fooled" by LLMs producing the "appearance" of valid outputs, rather than, say, novel functioning programs matching specs (the self-evident utility of LLMs in this niche is completely sidestepped), complete with inane analogies to non-cognitive work or routine one-off tasks like calculation. Then some sloppy musings on current limitations regarding in-context learning and lifelong learning or whatever (believe me, there's a great deal of work in this direction). What was this supposed to achieve?

In 2019, Chollet has published On the Measure of Intelligence, where he has proposed the following definition: “The intelligence of a system is a measure of its skill-acquisition efficiency over a scope of tasks, with respect to priors, experience, and generalization difficulty.” It's not far from yours, because frankly it's intuitive. Starting from this idea and aiming to test fluid thinking specifically, Chollet has also proposed ARC-AGI benchmark, which for the longest time was so impossibly hard for DL systems (and specifically LLMs) that many took that as evidence for the need to do “complete ground-up redesign from first principles” to make any headway. o3 was the first LLM to truly challenge this; Chollet coped by arguing that o3 is doing something beyond DL, some “guided program synthesis” he covets. From what we know, it just autoregressively samples many CoTs in parallel and uses a simple learned function to nominate the best one. As of now, it's clearly going to be saturated within 2 years as is ARC-AGI 2, and we're on ARC-AGI 3, with costs per problem solved plummeting. Neither 1 nor 3 are possible to ace for an orangutan or indeed for a human of below-average intelligence. Similar things are happening to “Humanity's Last Exam”. Let's say it's highly improbable at this point than any “complete ground-up redesign from first principles” will be necessary. Transformer architecture is rather simple and general, making it cheaper to train and inference without deviating from the core idea of “a stack of MLPs + expressive learned mixers” is routine, and virtually all progress is achieved by means of better data – not just “cleaner” or “more”, but procedural data predicting which necessitates learning generally useful mental skills. Self-verification, self-correction, backtracking, iteration, and now tool use, search, soliciting multi-agent assistance (I recommend reading Kimi K2 report, the section 3.1.1, for an small sliver of an idea of what that entails). Assembling necessary cognitive machines in context. This is intelligence, so poorly evidenced in your texts.

In order to align an AI to care about truth and accuracy you first need a means of assessing and encoding truth and it turns out that this is a very difficult problem within the context of LLMs, bordering on mathematically impossible.

We are not in 2013 anymore, nor on LessWrong, to talk of this so abstractly and glibly. "Reptile — legs = snake" just isn't an adequate level of understanding to explain behaviors of LLMs, this fares no better than dismissing hydrology (or neuroscience, for that matter) as mere applied quantum mechanics with marketing buzzwords. Here's an example of a relevant epistemically serious 2025 paper, "The Geometry of Self-Verification in a Task-Specific Reasoning Model":

We apply DeepSeek R1-Zero’s setup with Qwen2.5-3B as our base model (Hyperparams: Appx. A). Our task, CountDown, is a simple testbed frequently used to study recent reasoning models [9, 10, 32, 39 ] – given a set of 3 or 4 operands (e.g., 19, 36, 55, 7) and target number (e.g., 65), the task is to find the right arithmetic combination of the operands to reach the target number (i.e., 55 + 36 - 7 - 19). […] The model is given two rewards: accuracy reward for reaching the correct final answer, and a format reward when it generates its CoT tokens in between “” and “” tokens. […] Once we score each previous-token head using Eq. 8, we incrementally ablate one head at a time until we achieve perfect intervention scores (Section 4.4). Using this approach, we identify as few as three attention heads that can disable model verification. We notate this subset as AVerif. To summarize, we claim that the model has subspace(s) (polytope(s)), SGLUValid , for self-verification. The model’s hidden state enters this subspace when it has verified its solution. In our setting, given the nature of our task, previous-token heads APrev take the hidden-state into this subspace, while for other tasks, different components may be used. This subspace also activates verification-related GLU weights, promoting the likelihood of tokens such as “success” to be predicted (Figure 3). […]For “non-reasoning” models, researchers have studied “truthful” representations before [ 4 ], where steering towards a “truthful” direction has led to improvements in tasks related to factual recall [ 17]. In a similar vein, researchers have shown that the model’s representations can reveal whether they will make errors (e.g., hallucinations) [ 28 ], or when they are unable to recall facts about an entity [ 8 ]. Most recently, concurrent work [37, 41 ] also investigate how models solve reasoning tasks. [ 41 ] find that models know when they have reached a solution, while [ 37 ] decode directions that mediate behaviors such as handling uncertainty or self-corrections. While our work corroborates these findings, we take a deeper dive into how a reasoning model verifies its own reasoning trace. Circuit Analysis. A growing line or work decomposes the forward pass of a neural network as “circuits” [24], or computational graphs. This allows researchers to identify key components and their causal effects for a given forward pass. A common approach to construct computational graphs is to replace model components with dense activations with a sparsely-activating approximation. [ 6] introduces Transcoders to approximate MLP layers, while [ 1 ] further develops Cross-layer Transcoders to handle inter-layer features. [18 ] uses Cross-layer Transcoders to conduct circuit analyses for a wide range of behaviors, such as multi-step reasoning (for factual recall) or addition, and also investigate when a model’s CoT is (un)faithful…

The point of this citation is to drive home that any “first principles” dismissal of LLMs is as ignorant, or indeed more ignorant, than sci-fi speculation of laymen. In short, you suck and you should learn humility to do better to corroborate your very salient claim to authority.

There are good criticisms of LLMs. I don't know if you find Terence Tao's understanding of mathematics sufficiently grounded; he's Chinese after all. He has some skepticism about LLMs contributing to deep, frontier mathematical research. Try to do more of that.

5

Context

DaseindustriesLtd late version of a small language model 8mo ago

Well since you aren't telling what the tricky way is (the whole exchange is suggestive of paranoia to be honest), I can't reproduce, but

Which are bordering states of Nebraska, their population, vowel count, and vowels in order?

Neighboring states of Nebraska, with 2023 population estimates (rounded), vowel count, and the vowels they contain in order:

Missouri – 6.17 million – 3 vowels – o, u, i

Iowa – 3.20 million – 3 vowels – I, o, a
South Dakota – 0.92 million – 4 vowels – o, u, a, o, a → o, u, a (distinct only: o, u, a)
Kansas – 2.95 million – 2 vowels – a, a
Colorado – 5.84 million – 3 vowels – o, o, a
Wyoming – 0.58 million – 2 vowels – o, i

with the exception of Colorado's vowels it seems correct. I don't get better results from 2.5 Pro or Sonnet, certainly no hallucinated two states.

3

Context

DaseindustriesLtd late version of a small language model 8mo ago

Wait, where did my image go? I attached it, is it just gone?

Anyway, Kimi on OpenRouter:

Thought Process: To determine the vowels of "North Dakota" in order, we need to follow these steps:

Identify the word: The word in question is "North Dakota." Extract the vowels: From the word, we need to pick out only the vowels (A, E, I, O, U). Sometimes 'Y' is considered a vowel, but in this case, we will ignore it unless explicitly needed. List the vowels in order: Go through each letter of "North Dakota" in sequence and note the vowels in the order they appear. Step-by-Step Solution: Write out "North Dakota" and label each character with its position for clarity:

N (1), o (2), r (3), t (4), h (5), (space), D (7), a (8), k (9), o (10), t (11), a (12). Iterate through each character and identify vowels:

Position 1: 'N' – not a vowel. Position 2: 'o' – vowel (O). Position 3: 'r' – not a vowel. Position 4: 't' – not a vowel. Position 5: 'h' – not a vowel. Position 6: ' ' – space, ignore. Position 7: 'D' – not a vowel. Position 8: 'a' – vowel (A). Position 9: 'k' – not a vowel. Position 10: 'o' – vowel (O). Position 11: 't' – not a vowel. Position 12: 'a' – vowel (A). Collect the vowels in the order they appear:

o (position 2), a (position 8), o (position 10), a (position 12). Final answer: o, a, o, a

Kimi on Groq doesn't waste time:

The vowels in "North Dakota" in order are: o, a, o, a

5

Context

DaseindustriesLtd late version of a small language model 8mo ago

then claimed the vowels of North Dakota in order were o, h, a, and o

Can't replicate even with quantized provider. Most people report very different experience.

3

Context

DaseindustriesLtd late version of a small language model 8mo ago

update: Kimi K2 ranks #5 overall on LMArena under style control.

Tied #1 Multi-turn interaction

#2 Coding, Hard Prompts

The top US open models are Google's Gemma 3 27b and Nvidia's finetune of llama 3.1, ranked #28 and #34 respectively.

6

Context

DaseindustriesLtd late version of a small language model 8mo ago

There are tiers to this, from just weights release to full data+code+weights. Chinese labs mostly release weights and tech report with a reproducible (given some effort) recipe, sometimes code, rarely some or all of the data (more often parts of post-training data, though in these cases it's typically just links to datasets that have already been open).

I think nitpicking about open source is uninteresting when the recipe is available. This is a very dynamic field of applied science, rather than labor-intensive programming exercise. The volume of novel code in a given LLM project is comparable to a modest Emacs package, what matters is ideas (derisked at scale). Specific implementations are usually not that valuable – DeepSeek's GRPO, as described in their papers, has been improved upon in the open multiple times by this point. Data composition is dependent on your own needs and interests, there are vast open datasets, just filter them as you see fit.

4

Context

What is this place?

Why are you called The Motte?

New post guidelines

Rules

Recommended Posts And Communities

Recommended Realtime Chats

BANNED USER: Antagonism and unwillingness to calm down

>Unban in 86d 00h 31m

DaseindustriesLtd

BANNED USER: Antagonism and unwillingness to calm down

>Unban in 86d 00h 31m

DaseindustriesLtd