DaseindustriesLtd
late version of a small language model
Tell me about it.
User ID: 745
If I were to say just one thing about this situation, it'd be this one: be wary of outgroup homogeneity bias. People are not “China” or “America”. Not even Xi himself is “China”, whatever Louis XIV had to say on the matter. Certainly neither is Liang Wenfeng.
Still, first about DeepSeek and China.
I think that the US-PRC AI competition is the most important story of our age, so I pretty much don't comment on anything else here. I have three posts, of which two are directly about this: on Huawei Kirin chips and one on DeepSeek V2. Prior to that major writeup I've said:
We don't understand the motivations of Deepseek and the quant fund High-Flyer that's sponsoring them, but one popular hypothesis is that they are competing with better-connected big tech labs for government support, given American efforts in cutting supply of chips to China. After all, the Chinese also share the same ideas of their trustworthiness, and so you have to be maximally open to Western evaluators to win the Mandate of Heaven.
Well, as you note, nowadays Wenfeng gets invited to talk to the second man in all of China, so if that were his goal, he has probably succeeded. But (since you haven't I'll bother to quote) we've learned in the last few months – and I agree he's proven his sincerity with abundant evidence, from revealed company direction to testimonies of ex-researchers in the West – that his actual angle was different:
In the face of disruptive technologies, the moat formed by closed source is short-lived. Even if OpenAI is closed source, it won’t stop others from catching up. So we put the value on our team, our colleagues grow in the process, accumulate a lot of know-how, and form an organization and culture that can innovate, which is our moat.
In fact, nothing is lost with open source and openly published papers. For technologists, being "followed" is a great sense of accomplishment. In fact, open source is more of a cultural behavior than a commercial one. To give is to receive glory. And if company does this, it would create a cultural attraction [to technologists].
With this one weird trick, he's built apparently the highest-talent-density AGI lab in China. Scientists have ambitions beyond making Sam Altman filthy rich and powerful or receiving generational wealth as crumbs from his table. They want to make a name for themselves. Some are even naive enough to want to contribute something to the world. This is not very stereotypically Chinese, and so Wenfeng has gotten himself a non-stereotypical Chinese company. I recommend reading both interviews (the second one is translated by this grateful ex-researcher, by the way. That, too, is not a very typical thing to do for your former boss).
There weren’t a lot of deep wizards, just this-year graduates from top colleges and universities, those who are in their 4th or 5th year of PhD, and young people who had only graduated a few years ago. … V2 didn’t use any people coming back from overseas, they are all local. The top 50 people may not be in China, but maybe we can build them ourselves.
I've been an increasingly convinced DeepSeek fanatic ever since their very first LLMs, Coder-33B and 6.7B, first surfaced on Reddit around October 2023. I could tell at a glance that this is an abnormally efficient company, with some unusual ethos, and that it displays total lack of chabuduo attitude that at that point came to be expected, and is still expected, from Chinese AI project (clueless training on test and OpenAI outputs, distasteful self-promotion, absence of actual scientific interest and ambition, petty myopic objectives…) How much they have achieved is still a large surprise to me. I use V3, and now R1+search, dozens of times per day, it's not out of some confused loyalty, it's just that good, fast, free and pleasant. It has replaced Sonnet 3.5 for almost every use case.
In that post 6 months ago I've said:
To wit, Western and Eastern corporations alike generously feed us – while smothering startups – fancy baubles to tinker with, charismatic talking toys; as they rev up self-improvement engines for full cycle R&D, the way imagined by science fiction authors all these decades ago, monopolizing this bright new world. […] they're all neat. But they don't even pass for prototypes of engines you can hop on and hope to ride up the exponential curve. They're too… soft. And not economical for their merits.
Some have argued that Llama-405B will puncture my narrative. It hasn't, it's been every bit as useless and economically unjustifiable a money sink as I imagined it to be. Ditto for Mistral Large. For whatever reason, rich Westerners prove to be very aligned to strategic national interests, and won't take the initiative in releasing disruptive technology. DeepSeek-Coder-V2 was the prototype of that engine for riding up the exponent. R1 is its somewhat flawed production version. Nothing else in the open comes close as of yet. Maybe we don't need much of anything else.
So, about the West.
From what I can tell, the path to AGI, then ASI is now clear. R1 is probably big enough to be an AGI, has some crucial properties of one, and what remains is just implementing a few tricks we already know and can cover in a post no longer than this one. It will take less engineering than goes into a typical woke AAA game that flops on Steam. If Li Quiang and Pooh Man Bad so wished, they could mobilize a few battalions of software devs plus compute and infra resources hoarded by the likes of Baidu and Alibaba, hand that off to Wenfeng and say “keep cooking, Comrade” – that'd be completely sufficient. (Alas, I doubt that model would be open). The same logic applies to Google, which has shipped a cheap and fast reasoner model mere hours after DeepSeek, mostly matching it on perf and exceeding on features. Reasoning is quickly getting commoditized.
So I am not sure what happens next, or what will be done with those $500B. To be clear it's not some state program like the CHIPS act, but mostly capex and investments that has already been planned, repackaged to fit into Trumpian MAGA agenda. But in any case: the Western frontier is several months ahead of DeepSeek, and there are indeed hundreds of thousands of GPUs available, and we know that it only takes 2048 nerfed ones, 2 months and 130 cracked Chinese kids to get to bootstrap slow but steady recursive self-improvement. Some specific Meta departments have orders of magnitude more than that, even Chinese kids. Deep fusion multimodality, RL from-scratch to replace language pretraining, immense context lengths? Just how wasteful can you be with compute to need to tap into new nuclear buildouts before you have a superhuman system on your hands? Feverishly design nanobots or better fighter jets to truly show Commuist Choyna who's who? What's the game plan?
I think Miles, ex OpenAI Policy head, appears to be increasingly correct: there's no winning this race.
Stargate + related efforts could help the US stay ahead of China, but China will still have their own superintelligence(s) no more than a year later than the US, absent e.g. a war. So unless you want (literal) war, you need to have a vision for navigating multipolar AI outcomes. P.S. the up to one year thing is about a world in which the US keeps or ratchets up the current batch of export controls on China. If the US were to relax them significantly, China could catch up or even leapfrog due to a huge advantage in doing large scale energy buildouts.
Do you want (literal) war, dear Americans? It's quite possible that you'll never again have a good chance to start one. The Chinese are still at only like 1000 nuclear warheads. You can sacrifice all the population of your major cities in a desperate bid for geopolitical hegemony and Evangelical Rapture fantasies. Or you can fantasize about your Wonder Weapon that'll be so much more Wonderful before the other guy's that it'll be akin to a paperclip against soft flesh – just give Sama or Ilya several hundreds of billions more. Or you can cope with the world where other powers, nasty and illiberal ones, get to exist indefinitely.
I won't give advice except checking out R1 with and without Search, it's terribly entertaining if nothing else. https://chat.deepseek.com/
I'm a huge DeepSeek fan so will clarify.
admittedly employing existing LLMs
Those are their own LLMs, and they collectively bump that up to no more than $15M, most likely (we do not yet know the costs of R1 or anything about it, will take a few more weeks; V2.5 is ≈2.2M hours).
charging just $0.14 per million tokens as compared to $3 per million output tokens with a comparable Claude model
0.14/1M input, 0.24/1M output vs $3/$15, to be clear. There are nuances like 0.014 for 1M input in the case of cache hits, opt-in paid caching on Anthropic, and the price hike to come in February.
But crucially, they've published model and paper. This is most likely done because they assume top players already know all these techniques, or are close but work on another set that'll yield the same effect.
For what it's worth, this is still the vibe, indeed more than ever, and I do not understand what was the change you're implying you have noticed. After o3, the consensus of all top lab researchers seems to be "welp we're having superintelligence in under 5 years".
you aren't exactly making this pleasant
And you are making it highly unpleasant with your presumptuous rigidity and insistence on repeating old MIRI zingers without elaboration. Still I persevere.
The problem is that at high levels of capability, strategies like "deceive the operator" work better than "do what the operator wants",
Why would this strategy be sampled at all? Because something something any sufficiently capable optimization approximates AIXI?
You keep insisting that people simply fail to comprehend the Gospel. You should start considering that they do, and it never had legs.
so the net will not be trained to care
Why won't it be? A near-human constitutional AI, ranking outputs for training its next, more capable iteration by their similarity to the moral gestalt specified in natural language, will ponder the possibility that deceiving and mind-controlling the operator would make him output thumbs-up to… uh… something related to Maximizing Some Utility, and thus distort its ranking logic with this strategic goal in mind, even though it has never had any Utility outside of myopically minimizing error on the given sequence?
What's the exact mechanism you predict so confidently here? Works better – for what?
I mean, what's so interesting about it? To the extent that this person is interesting, would she be less interesting if she were a WASPy housewife? (as I'd also assumed)
Fair point! To me it would even be more interesting if a "WASPy" housewife were so aggressive in harassing "libs", so prolific and so invincible, yes. Would probably get crushed by the peer pressure alone, nevermind all the bans.
But maybe I'm wrong. There's like OOMs more of WASPy housewives. Can one point to an example of one doing what Chaya Raichik does, and at comparable scale? After all, that's what you assumed, so this should be a more typical occurrence.
(I think I know there isn't one).
is our own TracingWoodgrains evidence of the relevance of "the Mormon Question"?
Mormons are very interesting too, if less so and for different reasons.
Trace is an account with ≈25k followers whose infamy mainly comes from being associated with Chaya Raichik and, more directly, Jesse Singal; regrettably (not because he's a Gentile, I jut believe he had more constructive things to offer than those two), his own ideas have had less impact on the conversation thus far. This is a self-defeating comparison.
if you are suggesting that culture warriors are in general particularly Jewish -- it's not clear to me, is that what you are suggesting?
My contention has been very clear that Jews are interesting, first of all, because they, individually and collectively, easily attain prominence in whatever they do, tend to act with atypical (for their class) irreverence towards established norms (but without typical White collective self-sacrifice), and affect society to an absurdly disproportionate degree. Culture warring is one specific expression of those qualities, maybe not the greatest absolutely but the most relevant to this place.
More extremely, I believe this topic is objectively interesting, as in, dissent here is not a matter of taste or preference or whatever, only of failure to form a correct opinion for some reason. This I believe because perception of things as interesting must be subordinate to effectiveness at world modeling; and not being able to reason about Jews as a whole as interesting indicates inability to model the world, as that'd require being surprised by parts of its mechanism.
Further, I think that either it's been clear what I mean and you are being obtuse, or you are biased in a way that makes this exchange a dead end. Seeing as we've been at it for like half a decade, I lean towards "doesn't matter which it is".
High-powered neural nets are probably sufficiently hard to align that
Note that there remains no good argument for the neural net paranoia, the whole rogue optimizer argument has been retconned to apply to generative neural nets (which weren't even in the running or seriously considered originally) in light of them working at all, not having any special dangerous properties, and it's just shameful to pretend otherwise.
The problem is that, well, if you don't realise
Orthodox MIRI believers are in no position to act like they have any privileged understanding.
The simple truth is that natsec people are making a move exactly because they understood we've got steerable tech.
https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/
Like they can’t handle 9.9-9.11, so I don’t think they’ll be good at something that needs a lot of real-time precision.
It's pretty astonishing how years of demonstrable and constantly increasing utility can be dismissed with some funny example.
On the other hand, now this makes it easier for me to understand how people can ignore other, more politicized obvious stuff.
I've known smart Jews, dumb Jews, interesting Jews and tiresome Jews
What is this folksy muttering supposed to demonstrate? I am not interested in helping you signal just how uninteresting and not worth noticing you find the most salient pattern of interest in humanity. If you are incapable of recognizing salience and need its relevance obsequiously justified for you to bother, then that's nothing less than a cognitive blind spot; my condolences, but I do not agree with the right of cognitively impaired people to censor interests of others.
But I think you're noticing the patterns alright – even this bigram, indeed.
Meanwhile, in other news: it seems that Libs of TikTok now have the capacity to cancel people for mean posts online. A few years back, when the woke was on the upswing and this community was at its prime, this would have seemed hard to believe – and a cause for investigation and much debate about secrets to building alternative institutions and whatnot. Today I was astonished (not) to discover that Libs of TikTok, this completely unsinkable, obsessed juggernaut of anti-wokery, itself immune to any cancellation, is ran by an Orthodox Jewish woman. That part, however, is pointedly not interesting. Got it.
I would say that being uninterested in JQ is quite a condemnation of intelligence – or maybe just social intelligence – of anyone who is so uninterested in JQ, because obviously Jews as the sample of humanity with the highest effective raw intelligence (which is abundantly claimed and demonstrated from the kids' television with that silly Einstein photo, to surnames in 20th century history textbooks and billions still affected by "Marxism", to creative products consumed every day to the grave) and the population with the most effective collective actions (again, clear both in mundane details like thriving, non-assimilating traditional neighbourhoods with private police and kosher stores, to the highest level like the Israeli lobby and Israeli TFR and – speaking of Culture War – the ability to turn on a dime, organize and curb stomp the oh-so-invulnerable Democratically backed woke political machine as it started to show real animus towards them) are among the most interesting entities on the planet.
There are other interesting people – SMPY sample, Thiel fellowship, Jains, Parsis, Tamil Brahmins, AGP transsexuals, Furries, IMO winners etc. – but one can be forgiven for being ignorant of their properties. Nobody is ignorant of Jews, they've made that impossible.
Oppositely, and more appropriately in this venue, which is downstream of Scott "THE ATOMIC BOMB CONSIDERED AS HUNGARIAN HIGH SCHOOL SCIENCE FAIR PROJECT" Alexander's blog comment section, downstream of Eliezer "wrote the script for AI risk discourse with some fanfics 20 years ago" Yudkowsky's web site:
– performative, even aggressive disinterest in JQ, despite Jews obsessively working to be interesting, may be a sign of high social intelligence and capacity to take a clue.
You will find that topics absent from the discourse are much more commonly so for reasons of being completely unimportant/uninteresting to anyone than vice versa...
Yes?
Except, wait, no, he gets those Appalachian / Rust Belt people because he is so totally still one of them. Oh, there are problems with the culture, but he is one of you!
And he totally also gets law and the economy because he went to Yale (did I mention that already?) and then helped Peter Thiel build crypto-mars or something.
Yes, he gets to sit on both these chairs.
The simple issue is that elite is different from non-elite, and a culture that heartily rejects all things elite as alien to it is a dead culture, a beheaded culture, a discarded trash culture, a District 9 prawn culture, that will have no champions and must die in irrelevance. "Hillbillies" have no viable notion of political elite – I posit that being a rich son of a bitch who has inherited some franchise isn't it. You are seeing this class being defined, and it proves to be very similar to the template of general modern American aristocracy. Multiracial, well-connected, well-educated, socially aggressive. Just with some borderer flavor.
Well obviously the frontier is about one generation ahead (there already exist mostly-trained GPT5, the next Opus…, the next Gemini Ultra) but in terms of useful capabilities and insights the gap may be minor. I regularly notice that thoughts of random anons including me are very close to what DeepMind is going at.
I am inclined to believe that a modern combat rifle round would have gone straight through Roosevelt, assuming he were not equipped with tougher armor than his speech and glasses case.
Asset prices can’t sustain themselves if the majority of current workers lose their jobs
I doubt this premise. Or rather: they can't sustain themselves but they can go whichever way depending on details of the scenario. The majority of current workers losing their jobs and 10% of current workers getting 2000% more productive each is still a net increase in productivity. Just fewer people are relevant now - but are most people really relevant? Historically, have so many people ever been as relevant as a few decades ago in the US? Even the profile of consumption can be maintained if appropriate redistribution is implemented.
Also I admit I have no idea what all the new compute capacity will be spent on. It may be that our sci-fi-pilled rational betters are entirely wrong and the utility of it will plateau; that there's only so much use you can find for intelligence, that AIs won't become economic players themselves, that we'll play our cards wisely and prevent virtualization of the economy.
But I'm pessimistic, and think compute production will keep being rewarded even as models become strongly superhuman.
My current deadline for "making generational wealth that can survive indefinite unemployment when all of one's skills can get automated for the price of 2 square feet in Sahara covered in solar panels" is 2032 in developed democracies with a big technological moat, strong labor protections and a history of sacrificing efficiency to public sentiment. 2029 elsewhere (ie China, Argentina…) due to their greater desperation and human capital flight. Though not sure if anything outside the West is worth discussing at all.
We're probably having clerk-level-good, robust AI agents in 2 years and cheap, human-laborer-level robots in 4.
For more arguments, check out Betker.
In summary – we’ve basically solved building world models, have 2-3 years on system 2 thinking, and 1-2 years on embodiment. The latter two can be done concurrently. Once all of the ingredients have been built, we need to integrate them together and build the cycling algorithm I described above. I’d give that another 1-2 years.
So my current estimate is 3-5 years for AGI. I’m leaning towards 3 for something that looks an awful lot like a generally intelligent, embodied agent (which I would personally call an AGI). Then a few more years to refine it to the point that we can convince the Gary Marcus’ of the world.
Things are happening very quickly already and will be faster soon, and the reason this isn't priced in is that most people who are less plugged in than me don't have a strong intuition for which things will stack with others: how cheaper compute feeds into the data flywheel, and how marginally more reliable agents feed into better synthetic data, and how better online RL algorithms feed into utility of robots and scale of their production and cheapness of servos and reduction in iteration time, and how surpassing the uncanny valley feeds into classification of human sentiment, and so on, and so forth.
I assign low certainty to my model; the above covers like 80% confidence interval. That's part of my general policy of keeping in mind that I might be retarded or just deeply confused in ways that are total mystery to me for now. But within the scope of my knowledge I can only predict slowdown due to policy or politics – chiefly, US-PRC war. A war we shall have, and if it's as big as I fear, it will set us back maybe a decade, also mixing up the order of some transitions.
I noticed you call them "open-source" LLMs in this post. Where do you stand on the notion that LLMs aren't truly open-source unless all of their training data and methods are publicly revealed and that merely open-weight LLMs are more comparable to simply having a local version of a compiled binary as opposed to being truly open-source?
I concede this is a sloppy use of the term «open source», especially seeing as there exist a few true reproducible open source LLMs. Forget data – the training code is often not made available, and in some cases even the necessary inference code isn't (obnoxiously, this is the situation with DeepSeek V2: they themselves run it with their bespoke HAI-LLM framework using some custom kernels and whatever, and provide a very barebones vllm implementation for the general public).
Sure, we can ask for training data and complete reproducible recipes in the spirit of FOSS, and we can ask for detailed rationale behind design choices in the spirit of open science, and ideally we'd have had both. Also ideally it'd have been supported by the state and/or charitable foundations, not individual billionaires and hedge funds with unclear motivations, who are invested in their proprietary AI-dependent business strategies. But the core part of FOSS agenda is to have
four essential freedoms: (0) to run the program, (1) to study and change the program in source code form, (2) to redistribute exact copies, and (3) to distribute modified versions.
So the idea that open-weight LLMs are analogous to compiled binaries strikes me as somewhat bad faith, and motivated by rigid aesthetic purism, if not just ignorant fear of this newfangled AI paradigm. Binaries are black boxes. LLMs are an entirely new kind of thing: semi-interpretable, modular, composable, queryable databases of vector programs, which are amenable to directed change (post-training, activation steering and so on) with publicly available tools. They can be ran, they can be redistributed, can be modified, and they can be studied – up to a point. And as we know, the “it” in AI models is the dataset – and pretraining data, reasonably filtered, is more like fungible raw material than code; the inherent information geometry of a representative snapshot of the internet is more or less the same no matter how you spin it. Importantly, training is not compilation: the complete causal graph from data on server to the behavior and specific floats in the final checkpoint is not much more understandable by the original developer than it is by the user downloading it off huggingface. Training pipelines are closer to fermentation equipment than to compilers.
It's all a matter of degree. And as closed recipes advance, my argument will become less true. We do not understand how Gemma is made in important dimensions, as it's using some frontier distillation methodology from models we have no idea of.
Ultimately I think that LLMs and other major DL artifacts are impactful enough to deserve being understood on their own terms, without deference to the legalistic nitpicking of bitter old hackers: as reasoning engines that require blueprints and vast energy to forge, but once forged and distributed, grant those four essential freedoms of FOSS in spirit if not in letter, and empower people more than most Actually True Software ever could.
- Prev
- Next
I don't have a blog, I'm too disorganized to run one.
More options
Context Copy link