@DaseindustriesLtd's banner p

DaseindustriesLtd

late version of a small language model

75 followers   follows 27 users  
joined 2022 September 05 23:03:02 UTC

Tell me about it.


				

User ID: 745

DaseindustriesLtd

late version of a small language model

75 followers   follows 27 users   joined 2022 September 05 23:03:02 UTC

					

Tell me about it.


					

User ID: 745

Instead, current research strongly suggests that LLMs are primarily pattern-recognition systems that infer regularities purely from text statistics rather than internally representing the world in a structured, grounded way.

…do you imagine that cause-effect relationships do not constitute a “regularity” or a “pattern”?

A wrapper runs on already subsidised tokens by subsidising them more. Inference costs coming down will not justify the 500 billion plus

Could you pick a lane? Either this is all a terrible money burner or inference costs are coming down. In reality frontier labs have like 80% margins on inference, they're in the red mostly due to training spending. Even DeepSeek is profitable as far as inference is concerned. Anthropic constantly suffers from inability to serve demand. There aren't that many receptionists in the world, no. It is possible that current expenditures will not be recouped, but that will only lead to a freeze in training spending. It's pretty clear that we could run all those GPUs at significant profit for years.

No. This is, however, exactly what OP is doing, only he goes to more length to obfuscate it, to the point that he fails to sneak in an actual argument. It's just words. I am smart (muh creds), others are dumb (not math creds), they're naive and get fooled because they're dumb and anthropomorphise, here are some musings on animals (I still don't see what specific cognitive achievement an orangutan can boast of, as OP doesn't bother with this), here's something about embeddings, now please pretend I've said anything persuasive about LLM intelligence. That's the worst genre of a post that this forum has to offer, it's narcissistic and time-wasting. We've had the same issue with Hlynka, some people just feel that they're entitled to post gibberish on why LLMs must be unintelligent and they endeavor to support this by citing background in math while failing to state any legible connection between their (ostensible) mathematically informed beliefs and their beliefs re LLMs. I am not sure if they're just cognitively biased in some manner or if it's their ego getting in the way. It is what it is.

Like, what is this? OP smirks as he develops this theme, so presumably he believes it to be load-bearing:

[…] Please keep this concept of "directionality" in mind as it is important to understanding how LLMs behave, and it will come up later.

[…] In addition to difficulty with numbers there is the more fundamental issue that directionality does not encode reality. The directionality of the statement "Donald Trump is the 47th President of the United States", would be identical regardless of whether Donald Trump won or lost the 2024 election. Directionally speaking there is no difference between a "real" court case and a "fictitious" court case with identical details.

The idea that there is a ineffable difference between true statements and false statements, or between hallucination and imagination is wholly human conceit. Simply put, a LLM that doesn't "hallucinate" doesn't generate text or images at all. It's literally just a search engine with extra steps.

No, seriously? How does one address this? What does the vector-based implementation of representations in LLMs have to do with the ineffable difference between truth and falsehood that people dumber than OP allegedly believe in? If the pretraining data is consistent that Trump is the 47th president, then the model would predict as much and treat it as "truth". If we introduce a "falsehood" steering vector, it would predict otherwise. The training data is not baseline reality, but neither is any learned representation including world models in our brains. What does “literally just a search engine with extra steps” add here?

This sort of talk is confused on so many levels at once that the only valid takeaway is that the author is not equipped to reason at all.

I do not obfuscate. I understand that he's trying to insult me and others, and I call him an ignorant slut without any of that cowardly nonsense, plus I make an argument. To engage more productively, I'd have had to completely reinvent his stream of subtle jabs into a coherent text he might not even agree with. I'd rather he does that on his own.

I really haven't entered a pissing contest (typo).

I find OP's text exceptionally bad precisely because it is designed as a high-quality contribution but lacks the content of one; what is true is not germane to the argument and what little is germane is not true, its substance is mere sneer, ideas about reactivity and perceptivity are not thought through (would we we consider humans modulo long term memory formation unintelligent?), the section on hallucinations is borderline incoherent. This is LLM-like in the worst sense possible. I've said many times that superficial adherence to the letter of rules of polite discussion while ignoring its spirit is unacceptable for me. Thus I deem it proper to name the substantial violations. If mods feel otherwise they should finally give me a time out or a block. I am not a very active participant and don't intend to rely on any residual clout.

Multiple people in this post were able to disagree with OP without resorting to prosaic insults in their first sentence.

Multiple people should be more motivated to call out time-wasting obfuscated bullshit before wasting their time. I am grateful to @rae for doing the tedious work of object-level refutation, but the problem is that the whole dismantled section on word2vec math is not relevant to OP's argument about lack of reactivity (which isn't supported by, well, anything), so OP doesn't feel like it is anything more than a nitpick, a pedantic challenge to his domain-specific technical competence. Why should anyone bother with doing more of that? Let's just get to the meat of the issue. The meat is: are LLMs intelligent? I've shown that rigorous, good faith objections to that have a poor track record.

At the risk of getting into it with you again. What did you think of this when it made its rounds 2 months ago: https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf

I think I've already responded to that but maybe not. The meta issue with Apple papers is that their DL team is coping about repeated failures to build a competitive system (it may be that such philosophical handicaps get in the way). The object level issue with their tests is covered in this series of posts on X. One relevant piece:

  • If you actually look at the output of the models you will see that they don't even reason about the problem if it gets too large:

"Due to the large number of moves, I'll explain the solution approach rather than listing all 32,767 moves individually"

  • At least for Sonnet it doesn't try to reason through the problem once it's above ~7 disks. It will state what the problem and the algorithm to solve it and then output its solution without even thinking about individual steps.

Does this mean “0% accuracy”? I guess for people who believe “LLMs create billions of value by doing stuff like autonomously optimizing CUDA kernels, agriculture creates value by growing wheat, ergo wheat is as intelligent as an SWE? heh” is a clever dunk, it does.

There is a massive gulf in efficiency of understanding between people who approach LLMs with some rigid preconceived notions and people who can fucking look at the outputs and think about them. The gulf is so large that the former group can go through the motions of "empirical research" and publish papers proving how LLMs inherently can't do X or Y and not notice that they can, in their own setup, moreover that the setup is nonsensical. It's no longer a matter of polite disagreement, it's pure refusal to think, hiding your head in the sand. It's on par with paranormal research and homeopathy and should be treated as such: pushed out of the field and into self-funded fringe journals to die in obscurity.

Having no interest to get into a pissing context^W contest, I'll only disclose I've contributed to several DL R&D projects of this era.

This is the sort of text I genuinely prefer LLM outputs to, because with them, there are clear patterns of slop to dismiss. Here, I am compelled to wade through it manually. It has the trappings of a sound argument, but amounts to epitemically inept, reductionist, irritated huffing and puffing with an attempt to ride on (irrelevant) credentials and dismiss the body of discourse the author had found beneath his dignity to get familiar with, clearly having deep contempt for people working and publishing in the field (presumably ML researchers don't have degrees in mathematics or CS). Do even you believe you've said anything more substantial than “I don't like LLMs” in the end? A motivated layman definition of intelligence (not even citing Chollet or Hutter? Seriously?), a psychologizing strawman of arguments in favor of LLM intelligence, an infodump on embedding arithmetic (flawed, as already noted), random coquettish sneers and personal history, and arrogant insistence that users are getting "fooled" by LLMs producing the "appearance" of valid outputs, rather than, say, novel functioning programs matching specs (the self-evident utility of LLMs in this niche is completely sidestepped), complete with inane analogies to non-cognitive work or routine one-off tasks like calculation. Then some sloppy musings on current limitations regarding in-context learning and lifelong learning or whatever (believe me, there's a great deal of work in this direction). What was this supposed to achieve?

In 2019, Chollet has published On the Measure of Intelligence, where he has proposed the following definition: “The intelligence of a system is a measure of its skill-acquisition efficiency over a scope of tasks, with respect to priors, experience, and generalization difficulty.” It's not far from yours, because frankly it's intuitive. Starting from this idea and aiming to test fluid thinking specifically, Chollet has also proposed ARC-AGI benchmark, which for the longest time was so impossibly hard for DL systems (and specifically LLMs) that many took that as evidence for the need to do “complete ground-up redesign from first principles” to make any headway. o3 was the first LLM to truly challenge this; Chollet coped by arguing that o3 is doing something beyond DL, some “guided program synthesis” he covets. From what we know, it just autoregressively samples many CoTs in parallel and uses a simple learned function to nominate the best one. As of now, it's clearly going to be saturated within 2 years as is ARC-AGI 2, and we're on ARC-AGI 3, with costs per problem solved plummeting. Neither 1 nor 3 are possible to ace for an orangutan or indeed for a human of below-average intelligence. Similar things are happening to “Humanity's Last Exam”. Let's say it's highly improbable at this point than any “complete ground-up redesign from first principles” will be necessary. Transformer architecture is rather simple and general, making it cheaper to train and inference without deviating from the core idea of “a stack of MLPs + expressive learned mixers” is routine, and virtually all progress is achieved by means of better data – not just “cleaner” or “more”, but procedural data predicting which necessitates learning generally useful mental skills. Self-verification, self-correction, backtracking, iteration, and now tool use, search, soliciting multi-agent assistance (I recommend reading Kimi K2 report, the section 3.1.1, for an small sliver of an idea of what that entails). Assembling necessary cognitive machines in context. This is intelligence, so poorly evidenced in your texts.

In order to align an AI to care about truth and accuracy you first need a means of assessing and encoding truth and it turns out that this is a very difficult problem within the context of LLMs, bordering on mathematically impossible.

We are not in 2013 anymore, nor on LessWrong, to talk of this so abstractly and glibly. "Reptile — legs = snake" just isn't an adequate level of understanding to explain behaviors of LLMs, this fares no better than dismissing hydrology (or neuroscience, for that matter) as mere applied quantum mechanics with marketing buzzwords. Here's an example of a relevant epistemically serious 2025 paper, "The Geometry of Self-Verification in a Task-Specific Reasoning Model":

We apply DeepSeek R1-Zero’s setup with Qwen2.5-3B as our base model (Hyperparams: Appx. A). Our task, CountDown, is a simple testbed frequently used to study recent reasoning models [9, 10, 32, 39 ] – given a set of 3 or 4 operands (e.g., 19, 36, 55, 7) and target number (e.g., 65), the task is to find the right arithmetic combination of the operands to reach the target number (i.e., 55 + 36 - 7 - 19). […] The model is given two rewards: accuracy reward for reaching the correct final answer, and a format reward when it generates its CoT tokens in between “” and “” tokens. […] Once we score each previous-token head using Eq. 8, we incrementally ablate one head at a time until we achieve perfect intervention scores (Section 4.4). Using this approach, we identify as few as three attention heads that can disable model verification. We notate this subset as AVerif. To summarize, we claim that the model has subspace(s) (polytope(s)), SGLUValid , for self-verification. The model’s hidden state enters this subspace when it has verified its solution. In our setting, given the nature of our task, previous-token heads APrev take the hidden-state into this subspace, while for other tasks, different components may be used. This subspace also activates verification-related GLU weights, promoting the likelihood of tokens such as “success” to be predicted (Figure 3). […]For “non-reasoning” models, researchers have studied “truthful” representations before [ 4 ], where steering towards a “truthful” direction has led to improvements in tasks related to factual recall [ 17]. In a similar vein, researchers have shown that the model’s representations can reveal whether they will make errors (e.g., hallucinations) [ 28 ], or when they are unable to recall facts about an entity [ 8 ]. Most recently, concurrent work [37, 41 ] also investigate how models solve reasoning tasks. [ 41 ] find that models know when they have reached a solution, while [ 37 ] decode directions that mediate behaviors such as handling uncertainty or self-corrections. While our work corroborates these findings, we take a deeper dive into how a reasoning model verifies its own reasoning trace. Circuit Analysis. A growing line or work decomposes the forward pass of a neural network as “circuits” [24], or computational graphs. This allows researchers to identify key components and their causal effects for a given forward pass. A common approach to construct computational graphs is to replace model components with dense activations with a sparsely-activating approximation. [ 6] introduces Transcoders to approximate MLP layers, while [ 1 ] further develops Cross-layer Transcoders to handle inter-layer features. [18 ] uses Cross-layer Transcoders to conduct circuit analyses for a wide range of behaviors, such as multi-step reasoning (for factual recall) or addition, and also investigate when a model’s CoT is (un)faithful…

The point of this citation is to drive home that any “first principles” dismissal of LLMs is as ignorant, or indeed more ignorant, than sci-fi speculation of laymen. In short, you suck and you should learn humility to do better to corroborate your very salient claim to authority.

There are good criticisms of LLMs. I don't know if you find Terence Tao's understanding of mathematics sufficiently grounded; he's Chinese after all. He has some skepticism about LLMs contributing to deep, frontier mathematical research. Try to do more of that.

Well since you aren't telling what the tricky way is (the whole exchange is suggestive of paranoia to be honest), I can't reproduce, but

Which are bordering states of Nebraska, their population, vowel count, and vowels in order?

Neighboring states of Nebraska, with 2023 population estimates (rounded), vowel count, and the vowels they contain in order:

  1. Missouri – 6.17 million – 3 vowels – o, u, i
  1. Iowa – 3.20 million – 3 vowels – I, o, a
  2. South Dakota – 0.92 million – 4 vowels – o, u, a, o, a → o, u, a (distinct only: o, u, a)
  3. Kansas – 2.95 million – 2 vowels – a, a
  4. Colorado – 5.84 million – 3 vowels – o, o, a
  5. Wyoming – 0.58 million – 2 vowels – o, i

with the exception of Colorado's vowels it seems correct. I don't get better results from 2.5 Pro or Sonnet, certainly no hallucinated two states.

Wait, where did my image go? I attached it, is it just gone?

Anyway, Kimi on OpenRouter:


Thought Process: To determine the vowels of "North Dakota" in order, we need to follow these steps:

Identify the word: The word in question is "North Dakota." Extract the vowels: From the word, we need to pick out only the vowels (A, E, I, O, U). Sometimes 'Y' is considered a vowel, but in this case, we will ignore it unless explicitly needed. List the vowels in order: Go through each letter of "North Dakota" in sequence and note the vowels in the order they appear. Step-by-Step Solution: Write out "North Dakota" and label each character with its position for clarity:

N (1), o (2), r (3), t (4), h (5), (space), D (7), a (8), k (9), o (10), t (11), a (12). Iterate through each character and identify vowels:

Position 1: 'N' – not a vowel. Position 2: 'o' – vowel (O). Position 3: 'r' – not a vowel. Position 4: 't' – not a vowel. Position 5: 'h' – not a vowel. Position 6: ' ' – space, ignore. Position 7: 'D' – not a vowel. Position 8: 'a' – vowel (A). Position 9: 'k' – not a vowel. Position 10: 'o' – vowel (O). Position 11: 't' – not a vowel. Position 12: 'a' – vowel (A). Collect the vowels in the order they appear:

o (position 2), a (position 8), o (position 10), a (position 12). Final answer: o, a, o, a


Kimi on Groq doesn't waste time:

The vowels in "North Dakota" in order are: o, a, o, a

then claimed the vowels of North Dakota in order were o, h, a, and o

Can't replicate even with quantized provider. Most people report very different experience.

update: Kimi K2 ranks #5 overall on LMArena under style control.

  • Tied #1 Multi-turn interaction
  • #2 Coding, Hard Prompts

The top US open models are Google's Gemma 3 27b and Nvidia's finetune of llama 3.1, ranked #28 and #34 respectively.

There are tiers to this, from just weights release to full data+code+weights. Chinese labs mostly release weights and tech report with a reproducible (given some effort) recipe, sometimes code, rarely some or all of the data (more often parts of post-training data, though in these cases it's typically just links to datasets that have already been open).

I think nitpicking about open source is uninteresting when the recipe is available. This is a very dynamic field of applied science, rather than labor-intensive programming exercise. The volume of novel code in a given LLM project is comparable to a modest Emacs package, what matters is ideas (derisked at scale). Specific implementations are usually not that valuable – DeepSeek's GRPO, as described in their papers, has been improved upon in the open multiple times by this point. Data composition is dependent on your own needs and interests, there are vast open datasets, just filter them as you see fit.

I think R1 and the wave it's caused have already had an effect. It's frozen the ceiling on «frontier» pricing around $15/1M for models slightly but clearly better, such as Sonnet or 2.5 Pro (there are higher-tier offerings but they get very little purchase), encouraged the consumption of small distilled models like grok-mini or -Flash which directly compete with Chinese output, and clearly led OpenAI and Meta to try to ship a better open model for prestige (OpenAI may succeed yet). Amodei is coping, his company is among the more vulnerable ones and with the worst velocity; no matter how hard they lean on the DoD pork and national security rhetoric, everyone in the US does that now.

Expenditures have already happened, largely; datacenters are getting completed, giant training runs will be just a way to amortize that by producing models that will warrant higher inference volume and pricing. Base models on the level of Grok 3 are the floor for this generation, soon GPT-5 sets the next frontier floor. There is also an obvious pivot to agents/deep researchers/reasoners with extremely bloated, branching, parallelizable inference, and you need models to be smart enough to make sense of all those vast context dumps. Synthetic data scaling is focused on RL now, that also in effect requires to run a great deal of inference to produce higher-utility models. They won't cut expenditures, in short.

Sorry, misunderstood you. I don't think we've seen anyone seriously defend having stolen or distilled someone's model. My bet is the precedent will depend on who/whom and lawyer muscle rather than fundamentals of the situation.

How else could they achieve this result if their talent wasn't superior? Or if not talent, then the juice in an organization that allows good results at speed.

How small and relatively inexperienced Chinese labs do so much with so little is an interesting question. I have the impression that Western corporations overestimate “frontier talent”, or perhaps paradoxically – underestimate actual, raw talent (that isn't that rare, just needs to be noticed) and overestimate the value of corporate secrets that some of this legendary talent is privy to. Liang Wenfeng hires Ph.D students and they seem to do better than mature Ph.Ds.

H20s are useless for training, China will have to figure that part out on their own. Although the current RL paradigm is more and more reliant on inference (rollouts of agent trajectories, Kimi is built on that), so H20s will indirectly advance capabilities. Yet there remains a need for pretraining next generation bases, and of course experiments.

It usually involves taking a chapter from my novel and asking it to imagine it in a style from a different author I like.

Yes, many models (even open ones, such as R1) have better adherence to instructions. It writes well in its own style. I value models with distinct personalities. You're right about Russianness I think.

I've heard that they tried to buy-out Thinking Machines and SSI for billions, but were turned down

They've proposed that to even much smaller labs, though I'm not at liberty to share. Zuck is desperate and defaults to his M&A instincts that have served him well. It might work in dismantling the competition, at least. But it's not like Meta and FAIR were originally lacking in talent, they've contributed immensely to research (just for instance, MTP in DeepSeek V3 is based on their paper; Llama 4 somehow failed to implement it). The problem is managerial. To get ahead, I'm afraid Zuck will need to cut, rather than graft.

But Grok 4 just crushes with sheer size I think.

The fact that Grok is at all comparable (or indeed inferior) to Kimi on any metric, even the most obscure one, speaks to the deep cultural advantage of Moonshot. Grok 4's training compute is estimated to be 6.4e26 FLOPs; Kimi, like R1, is likely ≈4.0e24, fully 100 times less. They probably spent on scaling experiments for Grok 3/4 more than Moonshot has spent over their lifetime on everything. It's not really a fair competition, I admit Grok is a stronger model.

It was designed for Musk's vision of AI modelling and understanding the physical universe, that's what it's for and it does excellently there.

I think it wasn't designed with any specific focus in mind, it's an all around next-generation base+RL model.

I think the arc of history still bends towards Nvidia, the biggest company in the world and by some distance. I think like you I was leaning more towards the 'talent conquers all' ethos

You distort my argument. I was consistently skeptical that China can win this on HBD merits alone, after all the US also has plenty of talented people (very many of them Chinese, but also diverse global and domestic talent), in Nvidia and elsewhere, plus it has a giant and growing edge in compute. My thesis is that the gap in applied AI possibly won't be so profound as to allow some Pivotal Action To Secure Durable Strategic Advantage, that the hawks in DC and Dario Amodei fantasize about as they rail for more export controls. Nvidia will remain dominant, so will Western AI labs.

But so far China is doing better than I anticipated, both technically and ethically.

In regard to Zhilin's words, if I may psychologize a little, I think that it's very natural for a Chinese person with close knowledge of and experience with Western ideas and societies - but also an attachment to an identity as Chinese - to conceptualize things in terms of a dichotomy between East and West

If that is how the Chinese actors themselves conceptualize this, does it matter if we can object to such thinking as historically reductionist and stereotypical? Yes, obviously both types exist in both societies. (At least Zhilins exist in the US; if there are Liangs, I'd be happy to see them. Lambert is an academic, not a hedge fund CEO who also somehow happens to be a great engineer and an open source fanatic. The closest we had was Emad Mostaque, neither exactly Western nor very technical or good at being a CEO). But it is clear that the Chinese discourse, particularly in the VC sphere, maps pragmatic and idealistic archetypes onto the East-West dichotomy. Half of Liang's interview is the journalist saying “but this is madness, nobody does it, nobody will give the money for it” and Liang saying “and yet we must learn to do it, because crazy Westerners do and that is how they create things we've been imitating all this time” .

But since it's not a matter of "deep roots," it makes sense that a single breakout success like DeepSeek could precipitate a shift in orientation. … To go far afield of my knowledge, it seems as though these extrinsic factors might end up being better for China than for the US.

I agree this is a possibility, and I think it's one of the more interesting cultural trends to track, which is why I'm writing these updates. Deep roots or not, Chinese fast-following is more than a subtly racist trope, it really is the backbone of their economic ascendance. If they start similarly rewarding high-risk innovation, it'll change the gameboard a great deal.

Well, I don't really understand American law but it seems to me that Anthropic has set the precedent of LLM pretraining corpora being essentially immune to copyright claims. Anthropic's models are, ironically, the most paranoid about reproducing copyrighted material.

I am skeptical about diffusion even for images and video, the whole subfield is a giant nerd snipe for mathcels. Autoregression is strictly more expressive and more suitable for continual generation, sadly we pay the price of worse parallelization. If anything, I'd be more enthusiastic about honest-to-God Energy-based LLMs. There have been a series of breakthroughs in making DLLMs that don't totally suck and offer extreme speeds in low batch size regime, but eh. I think sparse attention Transformers will crush them.

I literally cite Kimi's own arguments for open source:

[…] 3. Why Open Source

#1: Reputation. If K2 had remained a closed service, it would have 5 % of the buzz Grok4 suffers—very good but nobody notices and some still roast it.

#2: Community velocity. Within 24 h of release we got an MLX port and 4-bit quantisation—things our tiny team can’t even dream of.

#3: It sets a higher technical bar. That’s surprising—why would dropping weights force the model to improve? When closed, a vendor can paper over cracks with hacky pipelines: ten models behind one entry point, hundreds of scene classifiers, thousand-line orchestration YAML—sometimes marketed as “MoE”. Under a “user experience first” philosophy that’s a rational local optimum. But it’s not AGI. Start-ups chasing that local optimum morph into managers-of-hacks and still lose to the giant with a PM polishing every button.

Kimi the start-up cannot win that game. Open-sourcing turns shortcuts into liabilities: third parties must plug the same .safetensors into run_py() and get the paper numbers. You’re forced to make the model itself solid; the gimmicks die. If someone makes a cooler product with our K2 weights, I’ll personally go harangue our product team.

DeepSeek's arguments are more ideological and cultural:

For technologists, being followed is a great sense of accomplishment. In fact, open source is more of a cultural behavior than a commercial one. To give is to receive glory. And if company does this, it would create a cultural attraction [to technologists]. […]

plus stuff about accelerating the development of Chinese ecosystem.

High-level researchers are not slaves or menial workers, they have massive pride, they want to publish and gain clout. You can pay them hundreds of millions to get over that, or you can let them publish. Open sourcing is the ultimate form of publishing your work.

There probably isn't a delay, plans to ship it in May to capitalize on the hype were entirely hallucinated by jo*rnalists as far as I can tell. It might take many months yet.

It's not really verbose in normal use, rather the opposite. It is verbose in agentic mode, writing docs and commentary rather than thinking traces. RL has been used for all serious LLMs since GPT 3-instruct, this is independent of the current long-CoT paradigm. It is dubious that Kimi has been trained on CoTs because it doesn't do them. More likely, its training data is largely final outputs of a reasoner (like Kimi's own 1.5/1.6). They have a section in the paper on 1.5 about penalizing verbosity.

Relevant from Lambert: The American DeepSeek Project

While America has the best AI models in Gemini, Claude, o3, etc. and the best infrastructure with Nvidia it’s rapidly losing its influence over the future directions of AI that unfold in the open-source and academic communities. Chinese organizations are releasing the most notable open models and datasets across all modalities, from text to robotics or video, and at the same time it’s common for researchers worldwide to read far more new research papers from Chinese organizations rather than their Western counterparts.

This balance of power has been shifting rapidly in the last 12 months and reflects shifting, structural advantages that Chinese companies have with open-source AI — China has more AI researchers, data1, and an open-source default.

[…] The goal for my next few years of work is what I’m calling The American DeepSeek Project — a fully open-source model at the scale and performance of current (publicly available) frontier models, within 2 years.2 A fully open model, as opposed to just an “open weights” model, comes with data, training code, logs, and decision making — on top of the weights to run inference — in order to distribute the knowledge and access for how to train AI models fully.

etc. He overstates the cause, perhaps. America doesn't need these egghead communist values of openness and reproducibility, the free market will align incentives, and charity shouldn't go too far. But he's pointing to the very real fact that China, and not on the state level but on the level of individual small companies with suitable culture, is the only country bringing transformative AI not locked on corporate clusters closer to reality.

Periodic Open-Source AI Update: Kimi K2 and China's Cultural Shift

(yes yes another post about AI, sorry about that). Link above is to the standalone thread, to not clutter this one.

Two days ago a small Chinese startup Moonshot AI has released weights of the base and instruct versions of Kimi K2, the first open (and probably closed too) Chinese LLM to clearly surpass DeepSeek's efforts. It's roughly comparable to Claude Sonnet 4 without thinking (pay no mind to the horde of reasoners at the top of the leaderboard, this is a cheap-ish capability extension and doesn't convey the experience, though is relevant to utility). It's a primarily agentic non-reasoner, somehow exceptionally good at creative writing, and offers a distinct "slop-free", disagreeable but pretty fun conversation, with the downside of hallucinations. It adopts DeepSeek-V3’s architecture wholesale (literally "modeling_deepseek.DeepseekV3ForCausalLM"), with a number of tricks gets maybe 2-3 times as much effective compute out of the same allowance of GPU-hours, and the rest we don't know yet because they've just finished a six-months marathon and don't have a tech report.

I posit that this follows a cultural shift in China’s AI ecosystem that I've been chronicling for a while, and provides a nice illustration by contrast. Moonshot and DeepSeek were founded at the same time, have near-identical scale and resources but have been built on different visions. DeepSeek’s Liang Wengeng (hedge fund CEO with Masters in engineering, idealist, open-source advocate) couldn't procure funding in the Chinese VC world with his inane pitch of “long-termist AGI research driven by curiosity” or whatever. Moonshot’s Yang Zhilin (Carnegie Mellon Ph,D, serial entrepreneur, pragmatist) succeeded at that task, got to peak $3,3 valuation with the help of Alibaba and Sequoia, and was heavily spending on ads and traffic acquisition throughout 2024, building a nucleus of another super-app with chatbot companions, assistants and such trivialities at a comfortable pace. However, DeepSeek R1, on merit of vastly stronger model, has been a breakout success and redefined Chinese AI scene, making people question the point of startups like Kimi. Post-R1, Zhilin pivoted hard to prioritize R&D spending and core model quality over apps, adopting open weights as a forcing function for basic progress. This seems to have inspired the technical staff: "Only regret: we weren’t the ones who walked [DeepSeek’s] path."

Other Chinese labs (Qwen, Minimax, Tencent, etc.) now also emulate this open, capability-focused strategy. Meanwhile, Western open-source efforts are even more disappointing than last year – Meta’s LLaMA 4 failed, OpenAI’s model is delayed again, and only Google/Mistral release sporadically, with no promises of competitive results.

This validates my [deleted] prediction: DeepSeek wasn’t an outlier but the first swallow and catalyst of China’s transition from fast-following to open innovation. I think Liang’s vision – "After hardcore innovators make a name, groupthink will change" – is unfolding, and this is a nice point to take stock of the situation.

So, despite 15 years of supposed collapse

There's been no collapse, but there is a great deal of ruin in a nation. Canada has been a great country to live in, and it is still good by OECD standards. Canada is just on a starkly different growth trajectory from the southern neighbor (to wit, it does NOT grow, among other problems), a trajectory that holds no promise of changing. 10-15 years ago it was on par with the US in per capita GDP, then trends diverged more and more. In 2020 this was still easy to ignore, and anyway perceptions take time to change, the British are still stereotyped as sophisticated gentlemen in some parts of the world.

You attribute the quality of Canada in the past to Canadians, and the current condition to the government of Canada. But it's, of course, made of Canadians as well. My point is that the governance of Canada seemed to be successful, based on results, and for the last decades the results are getting worse and worse.

Americans also have a lot of cruel things to say about their governance, and about each other. I dismiss their entitled whining because the actual performance has been and remains superior to the competition.

Census says 126,340 people left Canada for the U.S. in 2022, a 70 per cent increase over a decade ago

sounds bad, but what do I know.

Israelis hadn't changed, but the clearly rising antisemitism among the western left and its Islamic neighborhood pushed Israelis to vote for the one cynical hawk in town : Bibi. While politics shifted right, the average Israeli remained a normal person.

this is quite tendentious.

Israelis have changed, and will change more. Demographically, politically, culturally. Israel today is not some offshoot of Western Civilization but a higher-IQ Middle Eastern nation, with all that follows. This narrative is getting very stale.