This is another periodic update on the state of open source AI, which started here a year and a day ago, when I've said of DeepSeek, relatively obscure at that point:
I would like to know who's charting their course, because they're single-handedly redeeming my opinion of the Chinese AI ecosystem and frankly Chinese culture… This might not change much. Western closed AI compute moat continues to deepen, DeepSeek/High-Flyer don't have any apparent privileged access to domestic chips, and other Chinese groups have friends in the Standing Committee and in the industry, so realistically this will be a blip on the radar of history.
The chip situation is roughly stable. But Chinese culture, with regard to AI, has changed a bit since then.
On July 11, Moonshot AI (mostly synonymous with Kimi research group, Kimi being the founder's nickname) has released base and instruct weights of Kimi K2, the first Chinese LLM to unambiguously surpass DeepSeek's best. Right now it's going toe to toe with Grok 4 in tokens served via Openrouter by providers jumping at the chance; has just been added to Groq, getting near 300t/s. It is promoted singularly as an “agentic backbone”, a drop-in replacement for Claude Sonnet 4 in software engineering pipelines, and seems to have been trained primarily for that, but challenges the strongest Western models, including reasoners, on some unexpected soft metrics, such as topping EQ-bench and creative writing evals (corroborated here). Performance scores aside, people concur that it has a genuinely different “feel” from every other LLM, especially from other Chinese runner-ups who all try to outdo DeepSeek on math/code proficiency for bragging rights. Its writing is terse, dense, virtually devoid of sycophancy and recognizable LLM slop. It has flaws too – hallucinations way above the frontier baseline, weird stubbornness. Obviously, try it yourself. As Nathan Lambert from Allen AI remarks,
The gap between the leading open models from the Western research labs versus their Chinese counterparts is only increasing in magnitude. The best open model from an American company is, maybe, Llama-4-Maverick? Three Chinese organizations have released obviously more useful models with more permissive licenses: DeepSeek, Moonshot AI, and Qwen. A few others such as Tencent, Minimax, Z.ai/THUDM may have Llama-4 beat too
(As an aside. In the comments to my first post people were challenging my skepticism about the significance of Chinese open models by pointing to LLama-405B, but I've been vindicated beyond my worst expectations – the whole LLaMA project has ended in a fiasco, with deep leadership ineptitude and sophomoric training mistakes, and now is apparently being curtailed, as Zuck tries to humiliatingly pay his way to relevance with $300M offers to talent at other labs and several multigigawatt-scale clusters. Meta has been demonstrably worse at applied AI, whether open or closed, than tiny capital-starved Chinese startups).
But I want to talk a bit about the cultural and human dimension.
Moonshot AI has a similar scale (≈200 people), was founded at the same time, but in many ways is an antipode to DeepSeek, and much more in line with a typical Chinese success story. Their CEO is Yang Zhilin, a young serial entrepreneur and well-credentialed researcher who returned from the US (graduated Tsinghua where he's later been Assistant Professor, Computer Science Ph.D from Carnegie Mellon, worked at Google Brain, Meta). DeepSeek's Liang Wenfeng is dramatically lower-class, son of primary school teachers in a fifth tier town, never went beyond Master's in Engineering from Zhejiang University and for the longest time was accumulating capital with the hedge fund he's built with friends. In 2023-2024, soon after founding their startups, both gave interviews. Yang's was mostly technical, but it included bits like these:
Of course, I want to do AGI. This is the only meaningful thing to do in the next 10 years. But it's not like we aren't doing applications. Or rather, we shouldn't define it as an "application". "Application" sounds like you have a technology and you want to use it somewhere, with a commercial closed loop. But "application" is inaccurate. It's complementary to AGI. It's a means to achieve AGI and also the purpose of achieving AGI. "Application" sounds more like a goal: I want to make it useful. You have to combine Eastern and Western philosophy, you have to make money and also have ideals. […] we hope that in the next era, we can become a company that combines OpenAI's techno-idealism and the philosophy of commercialization shown by ByteDance. The Oriental utilitarianism has some merits. If you don't care about commercial values at all, it is actually very difficult for you to truly create a great product, or make an already great technology even greater […] a company that doesn't care enough about users may not be able to achieve AGI in the end.
Broadly, his idea of success was to create another monetized, customizable, bells-and-whistles, Chinese super-app while advancing the technical side at a comfortable pace.
Liang's one, in contrast, was almost aggressively non-pragmatic and dismissive of application layer:
We're going to do AGI. […] We won't prematurely focus on building applications on top of models. We will focus on large models. […] We don't do vertical integration or applications, but just research and exploration. […] It's driven by curiosity. From a distance, we want to test some conjectures. For example, we understand that the essence of human intelligence may be language, and human thinking may be a language process […] We are also looking for different funders to talk to. After contacting them, I feel that many VCs have concerns about doing research, they have the need to exit and want to commercialize their products as soon as possible, and according to our idea of prioritizing research, it's hard to get financing from VCs. […] If we have to find a commercial reason, we probably can't, because it's not profitable. […] Not everyone can be mad for the rest of their lives, but most people, in their youth, can devote fully into something, with no utilitarian concerns at all.
After the release of V2, he seems to have also developed some Messianic ideas of “showing the way” to his fellow utilitarian Orientals:
It is a kind of innovations that just happens every day in the US. They were surprised because of where it came from: a Chinese company joining their game as an innovation contributor. After all, most Chinese companies are used to following, not innovating. […] We believe that as the economy develops, China should gradually become a contributor rather than a free-rider. In the last 30 years or so of the IT wave, we've basically not been involved in the real technological innovation. […] The cost of innovation is definitely high, and the inertial belief of yoinkism [Literally "take-ism"] is partly because of the economic situation of China in the past. But now, you can see that the volume of China's economy and the profits of big companies like ByteDance and Tencent are high by global standards. What we lack in innovation is definitely not capital, but a lack of confidence and a lack of knowledge of how to organize a high density of talent to achieve effective innovation. […] For technologists, being followed is a great sense of accomplishment. n fact, open source is more of a cultural behavior than a commercial one. To give is to receive glory. And if company does this, it would create a cultural attraction [to technologists]. […] There will be more and more hardcore innovation in the future. It may not be yet easily understood now, because the whole society still needs to be educated by the facts. After this society lets the hardcore innovators make a name for themselves, the groupthink will change. All we still need are some facts and a process.
They've been rewarded according to their credentials and vision. Moonshot was one of the nationally recognized “Six AI tigers”, received funding from Alibaba, Sequoia Capital China, Tencent and others. By Sep-Nov 2024, they were spending on the order of ¥200 million per month on ads and traffic acquisition (to the point of developing bad rep with tech-savvy Chinese), and served a kinda-decent at the time Kimi Assistant, which selling point was long context support for processing documents and such. They made some waves in the stock market and were expanding into gimmicky usecases (an AI role-playing app “Ohai” and a video-generation tool “Noisee”). By June 2024 Kimi was the most-used AI app in China (≈22.8 million monthly visits). Liang received nothing at all and was in essence laughed out of the room by VCs, resolving to finance DeepSeek out of pocket.
Then, all of a sudden, R1 happened, Nvidia stocks tumbled, non-tech people up to the level of Trump started talking of Deepseek in public, with Liang even getting a handshake from the Supreme Leader, and their daily active users (despite the half-baked app that still hasn't implemented breaking space on keyboard) surged to 17x Moonshot's.
Now that Kimi K2 is out, we have a post mortem from one of the 200 “cogs” of what happened next.
[…] 3. Why Open Source #1: Reputation. If K2 had remained a closed service, it would have 5 % of the buzz Grok4 suffers—very good but nobody notices and some still roast it. #2: Community velocity. Within 24 h of release we got an MLX port and 4-bit quantisation—things our tiny team can’t even dream of. #3: It sets a higher technical bar. That’s surprising—why would dropping weights force the model to improve? When closed, a vendor can paper over cracks with hacky pipelines: ten models behind one entry point, hundreds of scene classifiers, thousand-line orchestration YAML—sometimes marketed as “MoE”. Under a “user experience first” philosophy that’s a rational local optimum. But it’s not AGI. Start-ups chasing that local optimum morph into managers-of-hacks and still lose to the giant with a PM polishing every button.
Kimi the start-up cannot win that game. Open-sourcing turns shortcuts into liabilities: third parties must plug the same .safetensors into run_py() and get the paper numbers. You’re forced to make the model itself solid; the gimmicks die. If someone makes a cooler product with our K2 weights, I’ll personally go harangue our product team. […] Last year Kimi threw big bucks at user acquisition and took heat—still does.
I’m just a code-monkey; insider intent is above my pay grade. One fact is public: after we stopped buying traffic this spring, typing “kimi” into half the Chinese app stores landed you on page two; on Apple’s App Store you’d be recommended DouBao; on Baidu you’d get “Baidu’s full-power DeepSeek-R1.” Net environment, already hostile, got worse. Kimi never turned ads back on. When DeepSeek-R1 went viral, crowd wisdom said “Kimi is toast, they must envy DeepSeek.” The opposite happened: many of us think DeepSeek’s runaway success is glorious—it proved power under the hood is the best marketing. The path we bet on works, and works grandly. Only regret: we weren’t the ones who walked it. At an internal retrospective meeting I proposed some drastic moves. Zhilin ended up taking more drastic ones: no more K1.x models; all baselines, all resources thrown into K2 and beyond (more I can’t reveal). Some say “Kimi should drop pre-training and pivot to Agent products.” Most Agent products die the minute Claude cuts them off. Windsurf just proved that. 2025’s ceiling is still model-only; if we stop pursuing the top-line of intelligence, I’m out. AGI is a razor-thin wire—hesitation means failure. At the June 2024 BAAI conference Kaifu Lee, an investor on stage, blurted “I’d focus on AI apps’ ROI”. My gut: that company’s doomed. I can list countless flaws in Kimi K2; never have I craved K3 as much as now.
…Technologically it's just a wider DS-V3, down to model type in the configs. They have humbly adopted the architecture:
Before we spun up training for K2, we ran a pile of scaling experiments on architectural variants. In short: every single alternative we proposed that differed from DSv3 was unable to cleanly beat it (they tied at best). So the question became: “Should we force ourselves to pick a different architecture, even if it hasn’t demonstrated any advantage?” Eventually the answer was no.
Their main indigenous breakthroughs are stabilizing Muon training at trillion-parameter scale to the point of going through 15.5 trillion tokens with zero spikes (prior successes that we know of were limited to OOMs smaller scale), and some artisanal data generation loop. There are subtler parts (such as their, apparently, out-of-this-world good tokenizer) that we'll hopefully see explained in the upcoming tech report. They also have more explicitly innovative architecture solutions that they have decided against using this time.
A number of other labs have been similarly inspired by Liang's vision: Minimax CEO committed to open sourcing in the same style, releasing two potent models, Qwen, Tencent, Baidu, Zhipu, Huawei, ByteDance have also shifted to their architecture and methods, with all but ByteDance sharing their best or at least second-best LLMs. Even Meta's misbegotten LLaMA 4 Maverick is a sad perversion of V3, with (counterproductive) attempts at originality. But so far only Kimi has clearly surpassed the inspiration.
One more note on culture. Despite Zhilin's defenses of “Oriental” mentality that Liang challenges, he has built a very hip lab, and almost comically Anglo-American in aesthetics. “We're a team of scientists who love rock (Radiohead, Pink Floyd) and film (Tarantino, Kubrick).” Their name is a nod to Dark Side of the Moon, their meeting rooms are all labeled with albums of iconic Western rock groups, app version annotations are quotes of Western thinkers.
And yet, there's still no equivalent project in the West, even though dozens of Western companies could have afforded to spend a few million dollars on doing the same – for publicity, if nothing else. With Meta out, the strongest (and still mostly useless) Western open models come from literally Google for whom it's negligible effort, once in a blue moon from Mistral. Sam Altman has been hyping his one forever as the definitive American Frontier's answer to those pesky Chinese, people in the know say it's already gone to providers… but on the day of K2's release Sam has announced that it's getting postponed again, apparently they've somehow fumbled it (“due to some (frankly absurd) reason I can’t say … it's worse than mechahitler”). and they intend a retrain. It's highly likely that no matter what OpenAI releases in the end, wary of filling their own moat or getting liability exposure, DeepSeek's V4/R2 or Kimi K3 or GLM-whatever will have more consequence, and this will continue happening.
This is a fairly strange situation. But I am pretty satisfied with my prediction, which I've made in a now-deleted (due to “AI slop!” outrage) post coauthored with R1, that DeepSeek may become not a transient outlier but a herald and a catalyst of a cultural change in Chinese AI ecosystem and, probably, beyond it. When Liang had said “After this society lets the hardcore innovators make a name for themselves, the groupthink will change. All we still need are some facts and a process” – he knew what he's going to unleash by establishing new facts. Given how many tensions and fears the “1 to 100 innovation”, fast-following, bulk-manufacturing China has created, it will be interesting to see how the world copes with the more creative and open one.
P.S. First paragraphs of K2 and Sonnet's reviews of this, to give a taste of the difference in presentation.
Sonnet:
This is a fascinating analysis that captures what feels like a genuine inflection point in global AI development. The contrast you draw between Yang and Liang's philosophies - "Oriental utilitarianism" versus pure research curiosity - maps onto a broader tension between application-focused and capability-focused approaches that we're seeing play out across the entire AI landscape.
K2:
This is the cleanest, most honest field-report I’ve read in a while. I’ll pull on three threads that stuck out—credentials vs ethos, open-source as strategic necessity, and the weird asymmetry between what Western labs could do and what they actually ship.
Jump in the discussion.
No email address required.
Notes -
God, just the review summaries at the bottom. The way American LLMs simp the user to is just viscerally disgusting at this point.
More options
Context Copy link
update: Kimi K2 ranks #5 overall on LMArena under style control.
The top US open models are Google's Gemma 3 27b and Nvidia's finetune of llama 3.1, ranked #28 and #34 respectively.
More options
Context Copy link
I'm not impressed by K2 so far at all. I did a check with one of my usual questions, and it did horribly. It hallucinated that North Dakota borders Nebraska, and then claimed the vowels of North Dakota in order were o, h, a, and o. I'm also getting quite bad results on programming questions as well, things that the trio of frontier models (o3, Opus, and Gemini 2.5 pro) handle with relative ease. It's not even that cheap, only being on par with Gemini 2.5 flash in that regard.
I hear it's decent at creative writing, but that's sort of a wishy-washy benchmark. Maybe it will become the smut model of choice like R1 was for a while? That's... something at least?
K2 is censored. It refused my prompts to write Evangelion and Nagatoro lemons on the grounds that the characters were underage. I tried Uzaki-chan, since that's a college setting, and it still refused because it didn't want to write smut about copyrighted characters. Then I tried an original story, and it gave me a million excuses before finally writing a non-explicit response.
Okay, so we need a jailbreak. Which is a pain in the ass, but since it's an open source model, at least it won't be patchable. I looked around, and I found Mei Unfettered. I applied it, and... that was not worth the effort. I still had to age up the characters, the prose was more like the obligatory sex scene in a romance novel than a Literotica story, and then there's this gem:
You're better off using Grok, jailbroken ChatGPT, and whatever experimental SOTA models get tested in the arena.
More options
Context Copy link
Can't replicate even with quantized provider. Most people report very different experience.
Wait, where did my image go? I attached it, is it just gone?
Anyway, Kimi on OpenRouter:
Thought Process: To determine the vowels of "North Dakota" in order, we need to follow these steps:
Identify the word: The word in question is "North Dakota." Extract the vowels: From the word, we need to pick out only the vowels (A, E, I, O, U). Sometimes 'Y' is considered a vowel, but in this case, we will ignore it unless explicitly needed. List the vowels in order: Go through each letter of "North Dakota" in sequence and note the vowels in the order they appear. Step-by-Step Solution: Write out "North Dakota" and label each character with its position for clarity:
N (1), o (2), r (3), t (4), h (5), (space), D (7), a (8), k (9), o (10), t (11), a (12). Iterate through each character and identify vowels:
Position 1: 'N' – not a vowel. Position 2: 'o' – vowel (O). Position 3: 'r' – not a vowel. Position 4: 't' – not a vowel. Position 5: 'h' – not a vowel. Position 6: ' ' – space, ignore. Position 7: 'D' – not a vowel. Position 8: 'a' – vowel (A). Position 9: 'k' – not a vowel. Position 10: 'o' – vowel (O). Position 11: 't' – not a vowel. Position 12: 'a' – vowel (A). Collect the vowels in the order they appear:
o (position 2), a (position 8), o (position 10), a (position 12). Final answer: o, a, o, a
Kimi on Groq doesn't waste time:
Here are my results from asking it the bordering states of Nebraska. Note that I ask this in a bit of a tricky way to check if the LLM is actually comprehending my question. Frontier models can almost always get it correct with the notable exception of the Wyoming test (they usually don't think a 'y' is a vowel in that word). But K2's performance is just pathetic, it's like 4o-mini levels of bad.
Well since you aren't telling what the tricky way is (the whole exchange is suggestive of paranoia to be honest), I can't reproduce, but
with the exception of Colorado's vowels it seems correct. I don't get better results from 2.5 Pro or Sonnet, certainly no hallucinated two states.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
Very well written OP. At what point will chinese advances start affecting the US more than they are now. Previously Anthropics CEO and human job hater Dario Amodei wrote pretty unprofessional things about what the r1 had achieved.
American investment is far higher in AI than china's and has not produced the same level of results for the value. Will we see more expenditure at this point so that labs can double down and make more llms that have billion dollar runs or will they slow down the investments?
Really good post. Thanks for posting this here.
I think R1 and the wave it's caused have already had an effect. It's frozen the ceiling on «frontier» pricing around $15/1M for models slightly but clearly better, such as Sonnet or 2.5 Pro (there are higher-tier offerings but they get very little purchase), encouraged the consumption of small distilled models like grok-mini or -Flash which directly compete with Chinese output, and clearly led OpenAI and Meta to try to ship a better open model for prestige (OpenAI may succeed yet). Amodei is coping, his company is among the more vulnerable ones and with the worst velocity; no matter how hard they lean on the DoD pork and national security rhetoric, everyone in the US does that now.
Expenditures have already happened, largely; datacenters are getting completed, giant training runs will be just a way to amortize that by producing models that will warrant higher inference volume and pricing. Base models on the level of Grok 3 are the floor for this generation, soon GPT-5 sets the next frontier floor. There is also an obvious pivot to agents/deep researchers/reasoners with extremely bloated, branching, parallelizable inference, and you need models to be smart enough to make sense of all those vast context dumps. Synthetic data scaling is focused on RL now, that also in effect requires to run a great deal of inference to produce higher-utility models. They won't cut expenditures, in short.
If you do an effort post that covers all of the topics you brought up, I'd be inclined to make a donation to whatever Patreon-enabled online cause you're into.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
This is the most blatant and open attempt at IP theft I have ever seen. Even in Quant Finance where everyone is at everyone else's throat all the time things don't get this open and base. Total lack of class from Zuckerberg.
Big funds are smart enough to have 2 year noncompetes for quants though. I’m not sure if this is a California law thing or tech stupidity but there’s no way Zuck would be offering this pay if he couldn’t get these people before Summer 2027.
Completely unenforceable in the UK and (increasingly) large parts of the US too. Any decent solicitor can get them cut down to 6 months, the 2 year non-compete is just one of the many shitty tricks employers play to try and discouraging us from jumping ship. The mechanism of action of a 2 year non-compete is through chilling effects, not legal enforcement.
I think gardening leave is still legal in both places though; after all it's just an extended notice period with modified duties essentially?
Yeah, gardening leave is legal, but again in the UK you can argue that the gardening leave is preventing you from exercising your skills and keeping them up to date, but this is a more complex argument that doesn't just insta win the first time a judge takes a look at the case like challenging a 2 year non-compete would and so you as the employee need to spend more and go through a lot of hassle, just to avoid a period of time where you're being paid for doing nothing... Naturally very few people challenge gardening leave and most prefer to just wait it out and work on personal projects in the meanwhile.
The company gets to protect its IP, the employee gets a long very well paid holiday and both are happy, but that doesn't mean the provision itself is legally watertight.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
People on twitter are sucking him off dry for paying nerds their true worth.
I was skeptical of his offers. Paying people 100 of millions sound stupid also because of the volatility of what's being done. A big ai winter would look bad to Metas investors.
I did want to ask you about this though as I have zero experience or understanding of finance and markets. Will his overcompensation backfire if the market for AI goes south?
He splurged a lot on VR which whilst admirable doesn't seem to be a household piece of tech. I remember it causing some stock chaos a few years ago. Not sure what this would look like.
There's a reason why the crazy pay offers are only being made to people who are likely to have IP sensitive information instead of e.g. newly graduating world class machine learning PhDs who are yet to be exposed to the IP at a top lab.
Yes it will backfire if the AI market goes south or even if Meta fails to produce a good product after all this IP theft.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
Thanks for the update; I'll be sure to check out Moonshot at some point. My expertise in AI is limited to being a casual user of ChatGPT and DeepSeek, so I won't say more about the technical side of things, but I wanted to comment on the cultural points.
In contemporary philosophy, there's an attitude towards ideas that tends to ignore their historical, cultural, etc. context and treat them "in themselves." I guess this is a "high-decoupler" attitude. Anyways, despite the obvious demerits to this approach, I think that it's basically correct, so I have a hard time with explanations of East/West differences based on culture or historical philosophies. In this case, the difference between supposed "Oriental utilitarianism" and "Western idealism" doesn't seem too different from what's already present in the West. We also have a contrast between the "pragmatic businessman" archetype and the "dreamer" archetype.
(In regard to Zhilin's words, if I may psychologize a little, I think that it's very natural for a Chinese person with close knowledge of and experience with Western ideas and societies - but also an attachment to an identity as Chinese - to conceptualize things in terms of a dichotomy between East and West, and it doesn't cause problems as long as one doesn't place too much weight on that way of thinking.)
In my (admittedly somewhat myopic and unresearched) view, the cultural problems in China's business community seem quite contingent. As everyone is, businesspeople, investors, etc. are subject to groupthink, prejudices, and bias towards past successes. But since it's not a matter of "deep roots," it makes sense that a single breakout success like DeepSeek could precipitate a shift in orientation. So I think that if China doesn't end up catching up in AI, the reasons will not be intrinsic to the Chinese, but extrinsic; for example, perhaps capital controls work, or it turns out that the open-source model doesn't work well in AI after all.
To go far afield of my knowledge, it seems as though these extrinsic factors might end up being better for China than for the US. Although the party is hardly omnicompetent at picking winners, as demonstrated by their prior neglect of DeepSeek, the benefits of taking a relatively consistent, unified stance (at least within Xi's tenure) might be enough to overcome the US's inherited advantage of a superior ecosystem, since our political system's replacement-level regulation and industrial policy is not exactly stellar. The US scores own-goals all the time; the CPC may well score one even worse, but it's not as consistent.
If that is how the Chinese actors themselves conceptualize this, does it matter if we can object to such thinking as historically reductionist and stereotypical? Yes, obviously both types exist in both societies. (At least Zhilins exist in the US; if there are Liangs, I'd be happy to see them. Lambert is an academic, not a hedge fund CEO who also somehow happens to be a great engineer and an open source fanatic. The closest we had was Emad Mostaque, neither exactly Western nor very technical or good at being a CEO). But it is clear that the Chinese discourse, particularly in the VC sphere, maps pragmatic and idealistic archetypes onto the East-West dichotomy. Half of Liang's interview is the journalist saying “but this is madness, nobody does it, nobody will give the money for it” and Liang saying “and yet we must learn to do it, because crazy Westerners do and that is how they create things we've been imitating all this time” .
I agree this is a possibility, and I think it's one of the more interesting cultural trends to track, which is why I'm writing these updates. Deep roots or not, Chinese fast-following is more than a subtly racist trope, it really is the backbone of their economic ascendance. If they start similarly rewarding high-risk innovation, it'll change the gameboard a great deal.
I think I pretty much agree with you.
I don't object to that way of thinking per se, but I doubt that it does much work, either for them or for us third-person observers. (I might be wrong - it seems to me that the strongest objection is that it's a very live option for Chinese people to take the West, or at least what the West is perceived to do well in, as a model, whereas it might be less so for Americans.) My armchair psychological theory is that talented, smart Chinese people are often those who assimilate into Western society most easily, and they - or we, I guess I should say - graft this way of thinking on top of prior instincts, desires, etc.
Not to say that it's entirely an inert superstructure, but my overall view is that it's significantly more informative to look at structural features of China's economy, such as regulations on investment or whatnot, etc., than at how it gets conceptualized in this sort of discourse. Unless one is interested in China's self-image for its own sake, naturally.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
Kimi is special, certainly. But I don't know that its comparable to Grok 4 in pushing out the frontier, though it's clearly far more cost-effective. Kimi is elegant, precise, concise and charming where Grok is uncharismatic. Kimi is so cheap that people will naturally use it a lot. Kimi is so cheap I'm going to use it a lot!
But Grok 4 just crushes with sheer size I think. It has this 'in this essay I will' style that lmarena certainly isn't going to like, or any normal person really. But it has that heft, it was made for ferociously unsexy mathematics, physics, engineering, research tasks rather than creative writing or coding. And even in creative writing it's pretty damn good, albeit more through precision of 'who, what, where' than literary flourish. Kimi has its moments of sheer brilliance but the model just doesn't have the grunt to back up its creator's talent, Grok will just find things it misses and enjoys greater depth of thought. It was designed for Musk's vision of AI modelling and understanding the physical universe, that's what it's for and it does excellently there.
I think the arc of history still bends towards Nvidia, the biggest company in the world and by some distance. I think like you I was leaning more towards the 'talent conquers all' ethos and there's much to be said for talent, more than lesswrong is willing to give certainly... yet mass and weight of compute will probably still prevail, albeit by a slimmer margin than one might think. Meta excepted naturally, whatever's going on there is something for the history books. Karmic vengeance for the constant stream of Yann's bad takes?
The fact that Grok is at all comparable (or indeed inferior) to Kimi on any metric, even the most obscure one, speaks to the deep cultural advantage of Moonshot. Grok 4's training compute is estimated to be 6.4e26 FLOPs; Kimi, like R1, is likely ≈4.0e24, fully 100 times less. They probably spent on scaling experiments for Grok 3/4 more than Moonshot has spent over their lifetime on everything. It's not really a fair competition, I admit Grok is a stronger model.
I think it wasn't designed with any specific focus in mind, it's an all around next-generation base+RL model.
You distort my argument. I was consistently skeptical that China can win this on HBD merits alone, after all the US also has plenty of talented people (very many of them Chinese, but also diverse global and domestic talent), in Nvidia and elsewhere, plus it has a giant and growing edge in compute. My thesis is that the gap in applied AI possibly won't be so profound as to allow some Pivotal Action To Secure Durable Strategic Advantage, that the hawks in DC and Dario Amodei fantasize about as they rail for more export controls. Nvidia will remain dominant, so will Western AI labs.
But so far China is doing better than I anticipated, both technically and ethically.
Fair enough, I agree on that. I didn't think you were saying that talent conquers all in this but one can kind of see it reading between the lines. How else could they achieve this result if their talent wasn't superior? Or if not talent, then the juice in an organization that allows good results at speed.
And it seems like export controls are diminishing, per latest news on H20s. But maybe Trump will do another backflip, who can say.
How small and relatively inexperienced Chinese labs do so much with so little is an interesting question. I have the impression that Western corporations overestimate “frontier talent”, or perhaps paradoxically – underestimate actual, raw talent (that isn't that rare, just needs to be noticed) and overestimate the value of corporate secrets that some of this legendary talent is privy to. Liang Wenfeng hires Ph.D students and they seem to do better than mature Ph.Ds.
H20s are useless for training, China will have to figure that part out on their own. Although the current RL paradigm is more and more reliant on inference (rollouts of agent trajectories, Kimi is built on that), so H20s will indirectly advance capabilities. Yet there remains a need for pretraining next generation bases, and of course experiments.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
I can't think of a single use case where Gemini 2.5 Pro isn't superior to Kimi (it says plenty about the model that I have to compare it to SOTA), including cost. Google is handing away access for free, even on the API. It's nigh impossible to hit usage limits while using Gemini CLI.
More options
Context Copy link
More options
Context Copy link
Excellent work as usual Dase. I was sorely tempted to write a K2 post, but I knew you could do it better.
I haven't asked it to write something entirely novel, but I have my own shoddy vibes-benchmark. It usually involves taking a chapter from my novel and asking it to imagine it in a style from a different author I like. It's good, but Gemini 2.5 Pro is better at that targeted task, and I've done this dozens of times.
Alas, it is fond of the ol' em-dash, but which model isn't. I agree that sycophancy is minimal, and in my opinion, the model is deeply cynical in a manner not seen in any other. I'd almost say it's Russian in outlook. I would have bet money on "this is a model Dase will like".
Meta's AI failure are past comical, and into farce. I've heard that they tried to buy-out Thinking Machines and SSI for billions, but were turned down. Murati is a questionable founder, but I suppose if any stealth startup can speed away underwater towards ASI, it's going to be one run by Ilya. Even then, I'd bet against it succeeding.
I don't know if it's intentional, but it's possible that Zuck's profligity and willingness to throw around megabucks will starve competitors of talent, but I doubt the kind of researcher and engineers at DS or Moonshot would have been a priori deemed worthy.
Yes, many models (even open ones, such as R1) have better adherence to instructions. It writes well in its own style. I value models with distinct personalities. You're right about Russianness I think.
They've proposed that to even much smaller labs, though I'm not at liberty to share. Zuck is desperate and defaults to his M&A instincts that have served him well. It might work in dismantling the competition, at least. But it's not like Meta and FAIR were originally lacking in talent, they've contributed immensely to research (just for instance, MTP in DeepSeek V3 is based on their paper; Llama 4 somehow failed to implement it). The problem is managerial. To get ahead, I'm afraid Zuck will need to cut, rather than graft.
More options
Context Copy link
More options
Context Copy link
Relevant from Lambert: The American DeepSeek Project
etc. He overstates the cause, perhaps. America doesn't need these egghead communist values of openness and reproducibility, the free market will align incentives, and charity shouldn't go too far. But he's pointing to the very real fact that China, and not on the state level but on the level of individual small companies with suitable culture, is the only country bringing transformative AI not locked on corporate clusters closer to reality.
More options
Context Copy link