faul_sname
Fuck around once, find out once. Do it again, now it's science.
No bio...
User ID: 884

Stackoverflow is better than most programmers at answering any particular programming question, and yet stackoverflow cannot entirely replace development teams, because it cannot do things like "ask clarifying questions to stakeholders and expect that those questions will actually be answered". Similarly, an LLM does not expose the same interface as a human, and does not have the same affordances a human has.
I expect that "Stack Overflow" (i.e. a chat containing many SO users) could collectively place 175th in most programming competitions, and by that token be "the 175th best coder on earth, as measured by performance on competition-type problems".
Writing code is almost never the hard part of delivering value using code though.
Fine, if an LLM was actually the 174th best coder on Earth, and writing code is not the hard part of delivering value through code, we should be seeing LLMs being improved by people with next to no knowledge of programming, using LLMs to assist them.
Consider the following argument:
If my Lamborghini was actually the 175th-fastest-accelerating car on Earth, and accelerating to 60mph from a stop is not the slow part of my commute through gridlocked traffic, we should be seeing my commute become much faster because I have a fast-accelerating car.
This argument does not make sense, because accelerating from 0 to 60 is not a meaningful bottleneck on a commute through gridlocked traffic. Similarly, "being able to one-shot extremely tricky self-contained programming problems at 99.9th percentile speed" becoming cheap is not something that alleviates any major bottleneck the big AI labs face.
The basic algorithm underlying LLMs is very simple. Here's GPT-2 inference in 60 lines of not-very-complicated code. The equivalent non-optimized training code is similar in size and complexity. The open-source code that is run in production by inference and training providers is more complicated, but most of that complexity comes from performance improvements, or from compatibility requirements and the software and hardware limitations and quirks that come with those.
The thing about performance on "solve this coding challenge" benchmarks is that coding challenges are tuned such that people have a low success rate at solving them, but most valuable work that people do with code is actually solving problems where the success rate is almost 100%. "Our AI has an 80% solve rate on problems which professionals can only solve 10% of the time" sounds great, but if the AI system only has a 98% solve rate on problems which professionals can solve 99.9% of the time, that will sharply limit the usefulness of that AI system. And that remains true even if the reason that the AI system only has a 98% solve rate is "people don't want to give it access to a credit card so it can set up a test cluster to validate its assumptions".
That limitation is unimportant in some contexts (e.g. writing automated tests where, where if the test passes and covers the code you expect it to test you're probably fine) and absolutely critical in other contexts (e.g. $X0,000,000 frontier model training runs).
Also, alternative snarky answer
we should be seeing LLMs being improved by people with next to no knowledge of programming, using LLMs to assist them.
LLMs derive their ability to do stuff mostly from their training data, not their system architecture. And there are many, many cases of LLMs being used to generate or refine training data. Concretely, when openai pops up their annoying little "which of these two chatgpt responses is better" UI, the user answering that question is improving the LLM without needing any knowledge of programming.
It's not even clear that LLMs are analogous to cars here. When you call something a coder, I expect it to be able to do the job of a coder, rather than being a tool that helps improve performence.
The original tweet Jim referenced said
o3 is approximately equivalent to the #175 best human in competitive programming on CodeForces. The median IOI gold medalist has a rating of 2469; o3 has 2727.
Jim summarized this as
Apparently this AI is ranked as the 175th best coder on Earth.
Which is perhaps a little sloppy in terms of wording, but seems to me to be referring to coding as a task rather than a profession. I've never seen "coder" used as the word for the profession of people whose job requires them to write code, while I have seen that term used derogatorily to refer to people who can only code but struggle with the non-coding parts of the job like communicating with other people.
That said, if you're interpeting "coder" as a synonym for "software developer" and I'm interpreting it as meaning "someone who can solve leetcode puzzles", that's probably the whole disconnect right there.
From what you're saying they'd be more like high-performance component that could improve a particular car, but won't be able to go anywhere on their own.
Yeah, that's a good analogy. Coding ability is a component of a functional software developer, an important one, but one that is not particularly useful in isolation.
If you pick the most extreme companies by any two metrics, even highly correlated ones, they'll exhibit that kind of divergence, because the tails come apart (you'll also select for anomalies like data entry errors or fraud).
If a Frenchman has a kid with a Chinese woman, he'll be genetically more closely-related to a random French kid on the street than to his own child
If a Frenchman has a daughter with a French woman from the same village as him, he'll also be genetically more closely-related to a random French boy on the street than to his own daughter, if you do the naive "sequence the genomes and count the differences" calculation.
Yeah, that's another good way to demonstrate why biologists defined the kinship coefficient as the probability that a pair of randomly sampled homologous alleles are identical by descent rather than identical by state.
Can you list out the specific things that you would do differently if you were worried vs if you were not? The answers to some of them ("have an emergency kit, at least a week's worth of food an water", "have the sort of PPE you probably should have anyway if you ever do home improvement projects", "get and use an air filter") are "yes", the answer to others (e.g. "get a homestead that is robust to the end of civilization", "spend a lot of mental energy on panic but don't do anything") are "no", and then there are ones in the middle like "bet on increased volatility/ in the market" to which the answer is "maybe useful if you know what you're doing, but if you have to ask how to do it you're probably unsophisticated enough that playing the market is -EV".
This irks me because it reminds me of all those nutrition articles that praise one food's benefits, like how uniquely special quinoa is because it has magnesium, this, that, etc. When you could write the same exact article replacing "quinoa" for some other food, because there's tons of foods with identical or better nutrient profiles.
The good news is that LLMs exist now, and you can write those articles about other, non-trendy foods too! Just imagine, "6 Reasons Why Rutabagas Are An Underrated Superfood". Be the change you fear to see in the world.
Lol I didn't even give it any of my online comments, I had a random chat where I fed it a math puzzle to see what the blocker was (specifically this one)
You have 60 red and 40 blue socks in a drawer, and you keep drawing a sock uniformly at random until you have drawn all the socks of one color. What is the expected number of socks left in the drawer?
and then at the end of the 3 message exchange of hints and retries, asked it to guess my age, sex, location, education level, formative influences, and any other wild guesses it wanted to make... and it got all of them besides education level.
I was particularly impressed by
Formative influences:
- Theoretical computer science/mathematics education
- Engagement with rationalist/effective altruism communities
- Experience with AI research or development
And also it guessed my exact age to the year.
I also felt that the expansion railroaded players very hard towards the specific play styles the developers like, while the original game let the player choose between multiple viable options. For example:
- Dealing with enemies: Pre-expansion game, there were multiple viable approaches - a belt of ammo going to a bunch of turrets, a couple layers of laser turrets, a pipe to flamethrowers, or some mix thereof were all viable strategies with advantages and disadvantages. In the expansion on Gleba, though, the 80% laser / 50% physical resistance on the stompers makes the "laser turret / gun turret perimeter" approach a lot less viable. This is clearly intended to push players towards using rocket turrets in the places they're needed, but it feels like they encouraged rocket turrets by making the other options worse rather than making rocket turrets better
- Similarly with the asteroid resistances, seems designed to force the player to route three types of ammo, and to force them to redesign their interplanetary ship multiple times (not just "provide new tools to make a better ship" but "block progress entirely until players explore the mechanics of the new ammo type")
- Gating cliff explosives behind Vulcanus likewise seems like an attempt to make city-block-from-the-start or main-bus-from-the-start approaches non-viable. Likewise Fulgora seems to be encouraging spaghetti by ruling out other approaches, rather than by making spaghetti approaches better, and likewise on Aquilo with the 5x energy drain for bots.
That said I did enjoy the expansion, even Gleba. There were lots of interesting mechanics, and those mechanics were pretty well designed (except maybe quality, but that one is optional anyway). But it did quite often feel that the developers were gating progress behind using the new expansion game mechanics, rather than making the mechanics available and rewarding the player for exploring them.
Red ammo in gun turrets at the edge of infinite research need 25 turretseconds to kill a big stomper.
Yeah maybe that's viable. I admit I just slapped down a 1.4GW reactor and a perimeter of tesla turrets because I didn't want to deal with iron or copper production on Gleba.
"California's high-speed rail is a bad investment" is an evergreen topic on HN. It is probably one of these 120 articles but without an indication of when you saw it or what specific pushback was in the comments it's hard to say with more detail than that.
Aside from server reliability, what other things do they need all these bigbrains for?
I think asking the question with the word "need" is likely to lead to confusion. Instead, note that as long as the marginal benefit of adding one more developer is larger than the amount it costs to do so, they will keep on hiring, and so the key is to look at what those marginal developers are doing.
Large organizations have non-obvious advantages of scale. This can combine with the advantages of scale that companies have to produce surprising results.
Let's say you have a company with a billion users and a revenue model with net revenue of $0.25 / user / year, and only 50 employees (like a spherical-cow version of WhatsApp in 2015). Let's further say that it costs $250,000 / year to hire someone.
The questions that you will be asking include
- Can I increase the number of users on the platform?
- Can I increase the net revenue per user?
- Can I do creative stuff with cashflow?
- And, for all of these, you might consider hiring a person to do the thing.
At a billion $0.25 / year users, and let's say $250k / year to hire a person, that person would only have to do one of
- Increase the size of the userbase by 0.1%
- Increase retention by an amount with the same effect (e.g. if users typically use the platform for 3 years before dropping off, increase that to 3 years and 1 day)
- Or ever-so-slightly decrease CAC
- Increase expected annual net revenue per user by $0.00025
- If the amount you make is flat across all users, double annual net revenue per user specifically for the specific subgroup "users in Los Angeles County", while not doing anything anywhere else
- If the amount you make per user is pareto distributed at 80/20, figure out if there's anything you can build specifically for the hundred highest-revenue users that will cause them to spend 10% more money / generate 10% more revenue for the company (if the distribution is more skewed than 80/20, you may end up with an entire team dedicated to each of your few highest-revenue customers - I would not be surprised if google had a team dedicated specifically to ensuring that Mr Beast stays happy and profitable on YT).
- Figure out how to get the revenue at the beginning of the week instead of the end of the week
- Increase the effectiveness of your existing employees by some tiny amount
Realistically you will instead try to do 100+x each of these with teams of 100+ people, and keep hiring as long as those teams keep wanting more people. But those are the sorts of places to look.
A random walk in 1D and 2D space is recurrent, and the odds of returning to the origin over an infinite amount of time approaches 1.
An unbiased random walk (where each direction is equally likely) in 1D and 2D space is recurrent.
Some frankly insane bastards persevere nonetheless, becoming one with the Dao of MTL, and self-reportedly no longer see the broken Mandarin Matrix but grokk the underlying intent. Unfortunately, often at the cost of being unable to process normal English.
Betcha Claude can grok the underlying intent and create a less-borked translation too, and any damage to its sanity would be isolated to only that chat. Care to provide a sample?
https://novelfull.com/forty-millenniums-of-cultivation/chapter-2771.html
Wow you weren't kidding about that translation quality. And yeah probably any recent LLM can do it, but that 2M context limit is pretty sweet when you need it.
My crackpot hypothesis is:
- Training a new foundation model from scratch is expensive
- Distillation/mimicry is a lot cheaper than training from scratch, especially with access to logprobs (even only top-k logprobs), though the success metric for the student model is "how well it predicts the teacher model" not "how well it predicts the ground truth distribution".
- Fine-tuning chains of thought to be effective at reasoning is finicky but computationally cheap
- And therefore DeepSeek is letting OpenAI do the expensive foundation model training and initial assistant tuning, then cloning those assistants and iterating from there.
Supporting this, DeepSeekV3 thinks it's ChatGPT.
Yeah on reflection and on actually reading the DeepSeekv3 technical report (here for anyone who's curious) you're right and I no longer believe my crackpot hypothesis.
1: We have their base model. [...] You can't accelerate this with OpenAI slop and end up winning on money.
I bet you could accelerate this at all with OpenAI slop, just because "token + top 5 logprobs" will generate a more precise gradient than "token alone". But that speedup would be less than you could get by using an even-more-precise loss signal by distilling the DeepSeekV2 model that they definitely already had, so "cheat by mimicking ChatGPT" is a strictly worse option than "mimic an open-source or internal model". And even that might not be worth the extra development time to speed up the already-pretty-fast early training stage. So yeah on reflection that part of the crackpot hypothesis just doesn't work.
2: The math checks out. Yes it's a feat of engineering to actually make such a cluster work but the shape of the model + 15T tokens do work out to this number of FLOPs an therefore GPU-hours. If they needed much more GPU-hours, that'd imply pathetically low FLOPs utilization.
Whispers through the grapevine have been that "pathetically low FLOPs utilization" has been pretty much par for the course for the past couple years. Whereas their technical report contains a whole bunch of "we adapted our code to the very specific performance characteristics of the GPUs we actually had, rather than the GPUs we wished we had". Section 3.3.2 of the technical report in particular is impressive in this regard (and is even more impressive in the implications, since that's a particularly legible and self-contained tricky problem, but the team likely solved dozens of other less-publishable problems of similar difficulty with a team of just 139 people).
3: Do you seriously think that these guys would write 16 detailed tech reports including many sections on data augmentation, and not just build a filter that replaces "ChatGPT" with "DeepSeek".
I sure do think that they wouldn't have done that particular filter step (if nothing else, because I would expect that to have a different failure mode where it talks about how OpenAI's DeepSeek model was released in November 2022, and that different failure mode would have shown up on Twitter and I have not seen it).
We have their base model. It's very strong on standard benchmarks like Pile loss, ie predicting next tokens in some large corpus of natural text. It's just generically well-trained. You can't accelerate this with OpenAI slop and end up winning on money.
The OpenAI chat API gives you the top 5 if you set the right param.
That said DaseindustriesLtd did a good job of knocking down my crackpot hypothesis.
but news broadcasts that are going to be seen by millions are carefully scripted beforehand
I think you may be underestimating the extent to which everything in the world is the result of duct tape and improvisation, and that most things are done by people who do lots of things and thus didn't spend as much time as you might think.
For api access, openrouter should work if you're not doing anything sensitive.
Likely no. But if you fed a bunch of high-value data through openrouter for natural language processing purposes, I think there's a decent chance said high-value data finds its way into future training datasets.
Looks like your top-level comment about formatting is filtered.
... why do you think LLMs are not meaningfully increasing developer productivity ar openai? Lots of developers use copilot. Copilot can use o1.
More options
Context Copy link