As someone who is not nearly as impressed with AI as you, thank you for the Turing test link. I've personally been convinced that LLMs were very far away from passing it, but I realize I misunderstood the nature of the test. It depends way too heavily on the motivation level of the participants. That level of "undergrad small-talk chat" requires only slightly more than Markov-chain level aptitude. In terms of being a satisfying final showdown of human vs AI intelligence, DeepBlue or AlphaGo that was not.
I still hold that we're very far away from AI being able to pass a motivated Turing test. For example, if you offered me and another participant a million dollars to win one, I'm confident the AI would lose every time. But then, I would not be pulling any punches in terms of trying to hit guardrails, adversarial inputs, long-context weaknesses etc. I'm not sure how much that matters, since I'm not sure whether Turing originally wanted the test to be that hard. I can easily imagine a future where AI has Culture-level intelligence yet could still not pass that test, simply because it's too smart to fully pass for a human.
As for the rest of your post, I'm still not convinced. The problem is that the model is "demonstrating intelligence" in areas where you're not qualified to evaluate it, and thus very subject to bullshitting, which models are very competent at. I suspect the Turing test wins might even slowly reverse over time as people become more exposed to LLMs. In the same way that 90s CGI now sticks out like a sore thumb, I'll bet that current day LLM output is going to be glaring in the future. Which makes it quite risky to publish LLM text as your own now, even if you think it totally passes to your eyes. I personally make sure to avoid it, even when I use LLMs privately.
“if the monster’s statistics are better than yours, or your HP is too low, run away.” This is making a decision more or less.
That's true, but if that leads to running from every battle, then you won't level up. Even little kids will realize that they're doing something wrong if they're constantly running. That's what I mean when I say it has a lot of disconnected knowledge, but it can't put it together to seek a goal.
One could argue that's an issue with its limited memory, possibly a fault of the scaffold injecting too much noise into the prompt. But I think a human with bad memory could do better, given tools like Claude has. I think the problem might be that all that knowledge is distilled from humans. The strategies it sees are adapted for humans with their long-term memory, spatial reasoning, etc. Not for an LLM with its limitations. And it can't learn or adapt, either, so it's doomed to fail, over and over.
I really think it will take something new to get past this. RL-based approaches might be promising. Even humans can't just learn by reading, they need to apply the knowledge for themselves, solve problems, fail and try again. But success in that area may be a long way away, and we don't know if the LLM approach of training on human data will ever get us to real intelligence. My suspicion is that if you only distill from humans, you'll be tethered to humans forever. That's probably a good thing from the safetyist perspective, though.
What makes things interesting is that the line between "creating plausible texts" and "understanding" is so fuzzy. For example, the sentence
my Pokemon took a hit, its HP went from 125 to _
will be much more plausible if the continuation is a number smaller than 125. "138" would be unlikely to be found in its training set. So in that sense, yes, it understands that attacks cause it to lose HP, that a Pokemon losing HP causes it to faint, etc. However, "work towards a goal" is where this seems to break down. These bits of disconnected knowledge have difficulty coming together into coherent behavior or goal-chasing. Instead you get something distinctly alien, which I've heard called "token pachinko". A model sampling from a distribution that encodes intelligence, but without the underlying mind and agency behind it. I honestly don't know if I'd call it reasoning or not.
It is very interesting, and I suspect that with no constraints on model size or data, you could get indistinguishable-from-intelligent behavior out of these models. But in practice, this is probably going to be seen as horrendously and impractically inefficient, once we figure out how actual reasoning works. Personally, I doubt ten years with this approach is going to get to AGI, and in fact, it looks like these models have been hitting a wall for a while now.
Claude didn't "get decent at playing" games in a couple of months. A human wrote a scaffold to let a very expensive text prediction model, along with a vision model, attempt to play a video game. A human constructed a memory system and knowledge transfer system, and wired up ways for the model to influence the emulator, read relevant RAM states, wedge all that stuff into its prompt, etc. So far this is mostly a construct of human engineering, which still collapses the moment it gets left to its own devices.
When you say it's "understanding" and "thinking strategically", what you really mean it that it's generating plausible-looking text that, in the small, resembles human reasoning. That's what these models are designed to do. But if you hide the text window and judge it by how it's behaving, how intelligent does it look, really? This is what makes it so funny, the model is slowly blundering around in dumb loops while producing volumes of eloquent optimistic narrative about its plans and how much progress it's making.
I'm not saying there isn't something there, but we live in a world where it's claimed that programmers will be obsolete in 2 years, people are fretting about superintelligent AI killing us all, openAI is planning to rent "phd level" AI agent "employees" to companies for large sums, etc. Maybe this is a sign that we should back up a bit.
I'm suspicious of these kinds of extrapolation arguments. Advances aren't magic, people have to find and implement them. Sometimes you just hit a wall. So far most of what we've been doing is milking transformers. Which is a great discovery, but I think this playthrough is strong evidence that transformers alone is not enough to make a real general intelligence.
One of the reasons hype is so strong is that these models are optimized to produce plausible, intelligent-sounding bullshit. (That's not to say they aren't useful. Often the best way to bullshit intelligence is to say true things.) If you're used to seeing LLMs perform at small, one-shot tasks and riddles, you might overestimate their intelligence.
You have to interact with a model on a long-form task to see its limitations. Right now, the people who are doing that are largely programmers and /g/ gooners, and those are the most likely people to have realistic appraisals of where we are. But this Pokemon thing is a entertaining way to show the layman how dumb these models can be. It's even better at this, because LLMs tend to stealthily "absorb" intelligence from humans by getting gently steered by hints they leave in their prompts. But this game forces the model to rely on its own output, leading to hilarious ideas like the blackout strategy.
The Turing test is an insanely strong test, in the sense that an AI that passes it can be seen to have achieved human-level intelligence at the very least. By this I mean the proper, adversarial test with a fully motivated and intelligent tester (and ideally, the same for the human participant).
Certainly no AI today could pass such a thing. The current large SotA models will simply tell you they are an AI model if you ask. They will outright deny being human or having emotions. I don't know how anyone could think these models ever passed a Turing test, unless the tester was a hopeless moron who didn't even try.
One could object that they might pass if people bothered to finetune them to do so. But that is a much weaker claim akin to "I could win that marathon if I ever bothered to get up from this couch." Certainly they haven't passed any such tests today. And I doubt any current AI could, even if they tried.
In fact, I expect we'll see true superhuman AGI long before such a test is consistently passed. We're much smarter than dogs, but that doesn't mean we can fully imitate them. Just like it takes a lot more compute to emulate a console such as the SNES than such devices originally had, I think it will require a lot of surplus intelligence to pretend to be human convincingly. If there is anything wrong with the Turing test, it's that it's way too hard.
I'm horrified that there are guides that lead people to use a 4090 to barely run a 14B model at .6t/s.
For clarity, with a single 4090 you should be able to run 14B at 8bit (near-flawless quality) and probably get more than 40T/s with tons of space left over for context. But you'd be better off running a 32B at 4-5bit, which should still have low quantization loss and massively better quality due to the model being larger. You can even painfully squeeze a 70B in there, but the quality loss is probably not worth it at the required ~2bit. All of those should run at 20-30 T/s minimum.
I think vllm is meant for real production use on huge servers. For home use I'd start with koboldcpp (really easy), llamacpp (requires cli) or ooba/tabbyapi with exl2. The latter is faster on pure gpu but has the downside that you have to deal with python instead of being pure standalone.
This is very common. For a long time, practically every open model was a distilled knockoff trained from synthetic data, mostly from OpenAI. It's been so common that people are familiar with the marks this leaves on the model. Such models are worse than the model they're distilled from, typically less flexible out of distribution (e.g. obeying unusual system prompts, prompts, ...) and have an even more intense "sloppy" vibe to them. It's very common, and people have long gotten bored with these knockoff models. Before deepseek, I'd even say that it's all people expected from Chinese models.
It also doesn't match what we're seeing from R1 at all though. One of the reasons R1 is so impressive is that its slop level is much lower, its creativity is way higher, and it doesn't sound like any of the existing AI models. Even Claude feels straitjacketed in comparison, much less OpenAI Models.
I wouldn't be surprised if they did use synthetic data, but whatever training method they're using seems to do a great job of hiding it. Which is amazing in itself. It could have something to do with the reinforcement learning phase that they do. But regardless, it's definitely not as simple as training on data from OpenAI, because people have been doing that forever.
I maintain that a lot of OpenAI's current position is derivative of a period of time where they published their research. You even have Andrej Karpathy teaching you in a lecture series how to build GPT from scratch on YouTube, and he walks you through the series of papers that led to it. It's not a surprise that competitors can catch up quickly if they know what's possible and what the target is.
If it's not a surprise, why didn't anyone else do it? Meta has had a giant cluster of H100s for a long time, but none of their models reached R1's level. Same for Mistral. I don't think following a GPT-from-scratch lecture is going to get you there. More likely there is a lot of data cleaning and operational work needed to even get close, and deepseek seems to be no slouch on the ML side either.
Given that they're more like ClosedAI these days, would any novel breakthroughs be as easy to catch up on?
I'm not convinced that they have any left to make. OpenAI's last big "wow" moment was the release of GPT4. While they've made incremental improvements since, we haven't seen anything like the release of R1, where people get excited enough to share model output and gossip about how it could be done. OpenAI's improvement is seen through benchmark results, and for that matter, through benchmarks they funded and have special access to.
It must be frustrating to work at OpenAI. It's possible that o1's reasoning methods are much more advanced than R1's, but who can tell? In the end, those who publish and release results will get the credit.
Is anyone excited about OpenAI at this point? They've spent what feels like forever selling hype for things that you can't use, and once (if ever) you finally get access to them, they end up feeling underwhelming. Sora got surpassed before it ever released. I expect people will prefer R1 to O3 in actual use even after the latter releases.
Right now OAI seems fixated on maxing out benchmarks and using that to build hype about "incoming AGI" and talk up how risky and awesome and dangerous their tech is. They can hype up the mystery "super agents" all they want and talk about how they'll "change everything", but for practical applications, Anthropic seems to be doing better, and now Deepseek is pushing boundaries with open models. Meanwhile, OAI seems to be trying to specialize into investment money extraction, focusing purely on hype building and trying to worm their way into a Boeing-type position with the US gov.
I don't expect anything to come of this "investment", but I'll be waiting eagerly for deepseek's next announcement. China seems to be the place to look for AI advancement.
When the city can just delete most of its trash cans and citizens will still largely refrain from littering while Americans are paying several full time salaries to pick up dog feces, that's not fully captured by GDP.
Is that net positive? Trash cans seem like a big win in terms of efficiency. It sucks to have to lug around dirty plastic wrappers until you get home, or to have to return your trash to whatever specific store you got it from. And Japan has lots of things that produce plastic waste. It's interesting that this might be one of the times where having some lower social trust people around might improve QoL, since it would force trash cans to be installed.
I'm also not sure Tokyo's great restaurant prices are a feature of social trust. It probably has more to do with density and sheer demand which makes the economics work. But then, cheap and practical fast food arguably started in the US, it just seems to have become much worse at it recently, which is weird.
- Prev
- Next

I don't use Rust, but I'm going to defend it in this case. In fact, I'll go further and defend the "buggy" code in the Cloudflare incident. If your code is heavily configurable, and you can't load your config, what else are you supposed to do? The same thing is true if you can't connect to your (required) DB, allocate (required) memory, etc. Sometimes you just need to die, loudly, so that someone can come in and fix the problem. IME, the worst messes come not from programs cleanly dying, but from them taking a mortal wound and then limping along, making a horrific mess of things in the process.
One can certainly criticize the code for not having a nicer error message. Maybe Rust is to blame for that, at least? Does unwrap not have a way to provide an error string? Although, any engineer should see what's going on from one look at the offending line, so I doubt it would make that much of a difference. It's not reasonable to blame a language for letting coders deliberately crash the program, either.
IMO, the code itself is fine. The problem is that they deployed a new config to the entire internet all at once without checking that it even loads. THAT is baffling.
More options
Context Copy link