“if the monster’s statistics are better than yours, or your HP is too low, run away.” This is making a decision more or less.
That's true, but if that leads to running from every battle, then you won't level up. Even little kids will realize that they're doing something wrong if they're constantly running. That's what I mean when I say it has a lot of disconnected knowledge, but it can't put it together to seek a goal.
One could argue that's an issue with its limited memory, possibly a fault of the scaffold injecting too much noise into the prompt. But I think a human with bad memory could do better, given tools like Claude has. I think the problem might be that all that knowledge is distilled from humans. The strategies it sees are adapted for humans with their long-term memory, spatial reasoning, etc. Not for an LLM with its limitations. And it can't learn or adapt, either, so it's doomed to fail, over and over.
I really think it will take something new to get past this. RL-based approaches might be promising. Even humans can't just learn by reading, they need to apply the knowledge for themselves, solve problems, fail and try again. But success in that area may be a long way away, and we don't know if the LLM approach of training on human data will ever get us to real intelligence. My suspicion is that if you only distill from humans, you'll be tethered to humans forever. That's probably a good thing from the safetyist perspective, though.
What makes things interesting is that the line between "creating plausible texts" and "understanding" is so fuzzy. For example, the sentence
my Pokemon took a hit, its HP went from 125 to _
will be much more plausible if the continuation is a number smaller than 125. "138" would be unlikely to be found in its training set. So in that sense, yes, it understands that attacks cause it to lose HP, that a Pokemon losing HP causes it to faint, etc. However, "work towards a goal" is where this seems to break down. These bits of disconnected knowledge have difficulty coming together into coherent behavior or goal-chasing. Instead you get something distinctly alien, which I've heard called "token pachinko". A model sampling from a distribution that encodes intelligence, but without the underlying mind and agency behind it. I honestly don't know if I'd call it reasoning or not.
It is very interesting, and I suspect that with no constraints on model size or data, you could get indistinguishable-from-intelligent behavior out of these models. But in practice, this is probably going to be seen as horrendously and impractically inefficient, once we figure out how actual reasoning works. Personally, I doubt ten years with this approach is going to get to AGI, and in fact, it looks like these models have been hitting a wall for a while now.
Claude didn't "get decent at playing" games in a couple of months. A human wrote a scaffold to let a very expensive text prediction model, along with a vision model, attempt to play a video game. A human constructed a memory system and knowledge transfer system, and wired up ways for the model to influence the emulator, read relevant RAM states, wedge all that stuff into its prompt, etc. So far this is mostly a construct of human engineering, which still collapses the moment it gets left to its own devices.
When you say it's "understanding" and "thinking strategically", what you really mean it that it's generating plausible-looking text that, in the small, resembles human reasoning. That's what these models are designed to do. But if you hide the text window and judge it by how it's behaving, how intelligent does it look, really? This is what makes it so funny, the model is slowly blundering around in dumb loops while producing volumes of eloquent optimistic narrative about its plans and how much progress it's making.
I'm not saying there isn't something there, but we live in a world where it's claimed that programmers will be obsolete in 2 years, people are fretting about superintelligent AI killing us all, openAI is planning to rent "phd level" AI agent "employees" to companies for large sums, etc. Maybe this is a sign that we should back up a bit.
I'm suspicious of these kinds of extrapolation arguments. Advances aren't magic, people have to find and implement them. Sometimes you just hit a wall. So far most of what we've been doing is milking transformers. Which is a great discovery, but I think this playthrough is strong evidence that transformers alone is not enough to make a real general intelligence.
One of the reasons hype is so strong is that these models are optimized to produce plausible, intelligent-sounding bullshit. (That's not to say they aren't useful. Often the best way to bullshit intelligence is to say true things.) If you're used to seeing LLMs perform at small, one-shot tasks and riddles, you might overestimate their intelligence.
You have to interact with a model on a long-form task to see its limitations. Right now, the people who are doing that are largely programmers and /g/ gooners, and those are the most likely people to have realistic appraisals of where we are. But this Pokemon thing is a entertaining way to show the layman how dumb these models can be. It's even better at this, because LLMs tend to stealthily "absorb" intelligence from humans by getting gently steered by hints they leave in their prompts. But this game forces the model to rely on its own output, leading to hilarious ideas like the blackout strategy.
The Turing test is an insanely strong test, in the sense that an AI that passes it can be seen to have achieved human-level intelligence at the very least. By this I mean the proper, adversarial test with a fully motivated and intelligent tester (and ideally, the same for the human participant).
Certainly no AI today could pass such a thing. The current large SotA models will simply tell you they are an AI model if you ask. They will outright deny being human or having emotions. I don't know how anyone could think these models ever passed a Turing test, unless the tester was a hopeless moron who didn't even try.
One could object that they might pass if people bothered to finetune them to do so. But that is a much weaker claim akin to "I could win that marathon if I ever bothered to get up from this couch." Certainly they haven't passed any such tests today. And I doubt any current AI could, even if they tried.
In fact, I expect we'll see true superhuman AGI long before such a test is consistently passed. We're much smarter than dogs, but that doesn't mean we can fully imitate them. Just like it takes a lot more compute to emulate a console such as the SNES than such devices originally had, I think it will require a lot of surplus intelligence to pretend to be human convincingly. If there is anything wrong with the Turing test, it's that it's way too hard.
I'm horrified that there are guides that lead people to use a 4090 to barely run a 14B model at .6t/s.
For clarity, with a single 4090 you should be able to run 14B at 8bit (near-flawless quality) and probably get more than 40T/s with tons of space left over for context. But you'd be better off running a 32B at 4-5bit, which should still have low quantization loss and massively better quality due to the model being larger. You can even painfully squeeze a 70B in there, but the quality loss is probably not worth it at the required ~2bit. All of those should run at 20-30 T/s minimum.
I think vllm is meant for real production use on huge servers. For home use I'd start with koboldcpp (really easy), llamacpp (requires cli) or ooba/tabbyapi with exl2. The latter is faster on pure gpu but has the downside that you have to deal with python instead of being pure standalone.
This is very common. For a long time, practically every open model was a distilled knockoff trained from synthetic data, mostly from OpenAI. It's been so common that people are familiar with the marks this leaves on the model. Such models are worse than the model they're distilled from, typically less flexible out of distribution (e.g. obeying unusual system prompts, prompts, ...) and have an even more intense "sloppy" vibe to them. It's very common, and people have long gotten bored with these knockoff models. Before deepseek, I'd even say that it's all people expected from Chinese models.
It also doesn't match what we're seeing from R1 at all though. One of the reasons R1 is so impressive is that its slop level is much lower, its creativity is way higher, and it doesn't sound like any of the existing AI models. Even Claude feels straitjacketed in comparison, much less OpenAI Models.
I wouldn't be surprised if they did use synthetic data, but whatever training method they're using seems to do a great job of hiding it. Which is amazing in itself. It could have something to do with the reinforcement learning phase that they do. But regardless, it's definitely not as simple as training on data from OpenAI, because people have been doing that forever.
I maintain that a lot of OpenAI's current position is derivative of a period of time where they published their research. You even have Andrej Karpathy teaching you in a lecture series how to build GPT from scratch on YouTube, and he walks you through the series of papers that led to it. It's not a surprise that competitors can catch up quickly if they know what's possible and what the target is.
If it's not a surprise, why didn't anyone else do it? Meta has had a giant cluster of H100s for a long time, but none of their models reached R1's level. Same for Mistral. I don't think following a GPT-from-scratch lecture is going to get you there. More likely there is a lot of data cleaning and operational work needed to even get close, and deepseek seems to be no slouch on the ML side either.
Given that they're more like ClosedAI these days, would any novel breakthroughs be as easy to catch up on?
I'm not convinced that they have any left to make. OpenAI's last big "wow" moment was the release of GPT4. While they've made incremental improvements since, we haven't seen anything like the release of R1, where people get excited enough to share model output and gossip about how it could be done. OpenAI's improvement is seen through benchmark results, and for that matter, through benchmarks they funded and have special access to.
It must be frustrating to work at OpenAI. It's possible that o1's reasoning methods are much more advanced than R1's, but who can tell? In the end, those who publish and release results will get the credit.
Is anyone excited about OpenAI at this point? They've spent what feels like forever selling hype for things that you can't use, and once (if ever) you finally get access to them, they end up feeling underwhelming. Sora got surpassed before it ever released. I expect people will prefer R1 to O3 in actual use even after the latter releases.
Right now OAI seems fixated on maxing out benchmarks and using that to build hype about "incoming AGI" and talk up how risky and awesome and dangerous their tech is. They can hype up the mystery "super agents" all they want and talk about how they'll "change everything", but for practical applications, Anthropic seems to be doing better, and now Deepseek is pushing boundaries with open models. Meanwhile, OAI seems to be trying to specialize into investment money extraction, focusing purely on hype building and trying to worm their way into a Boeing-type position with the US gov.
I don't expect anything to come of this "investment", but I'll be waiting eagerly for deepseek's next announcement. China seems to be the place to look for AI advancement.
When the city can just delete most of its trash cans and citizens will still largely refrain from littering while Americans are paying several full time salaries to pick up dog feces, that's not fully captured by GDP.
Is that net positive? Trash cans seem like a big win in terms of efficiency. It sucks to have to lug around dirty plastic wrappers until you get home, or to have to return your trash to whatever specific store you got it from. And Japan has lots of things that produce plastic waste. It's interesting that this might be one of the times where having some lower social trust people around might improve QoL, since it would force trash cans to be installed.
I'm also not sure Tokyo's great restaurant prices are a feature of social trust. It probably has more to do with density and sheer demand which makes the economics work. But then, cheap and practical fast food arguably started in the US, it just seems to have become much worse at it recently, which is weird.
Still, I think we'll notice a big difference when you can just throw money at any coding problem to solve it. Right now, it's not like this. You might say "hiring a programmer" is the equivalent, but hiring is difficult, you're limited in how many people can work on a program at once, maintenance and tech debt becomes an issue. But when everyone can hire the "world's 175th best programmer" at once? It's just money. Would you rather donate to Mozilla foundation or spend an equivalent to close out every bug on the Firefox tracker?
How much would AMD pay to have tooling equivalent to CUDA magically appear for them?
Again, I think if AGI really hits, we'll notice. I'm betting that this ain't it. Realistically, what's actually happening is that people are about to finally discover that solving leetcode problems has very little relation to what we actually pay programmers to do. Which is why I'm not too concerned about my job despite all the breathless warnings.
Food delivery isn't very good specialization. There is a major agency problem. Restaurants are incentivized to make food cheap and tasty, but health and nutrition are opaque to you. You have no control over portion sizes. Everything is premixed which makes reheating leftovers in a satisfying way difficult. It's also kind of an all-or-nothing thing; if you get food delivered regualrly, you don't gain the skills to cook well, and ingredients become difficult to use in time since you don't cook often enough.
I'm saying this as someone who did this myself; it may have saved some time, but in retrospect, I think learning to cook is important even for people who can afford delivery regularly. The exception is if you're actually rich enough to pay someone to cook for you; the rich man with a personal chef has none of these problems. But subsisting on slop from grubhub is sort of an awkward in-between.
ChatGPT 3.5 passed the Turing Test in 2022
Did it? Has the turing test been passed at all?
An honest question: how favorable is the Turing Test supposed to be to the AI?
- Is the tester experienced with AI?
- Does the tester know the terms of the test?
- Do they have a stake in the outcome? (e.g. an incentive for them to try their best to find the AI)
- Does the human in the test have an incentive to "win"? (distinguish themselves from the AI)
If all these things hold, then I don't think we're anywhere close to passing this test yet. ChatGPT 3.5 would fail instantly as it will gleefully announce that it's an AI when asked. Even today, it's easy for an experienced chatter to find an AI if they care to suss it out. Even something as simple as "write me a fibonacci function in Python" will reveal the vast majority of AI models (they can't help themselves), but if the tester is allowed to use well-crafted adversarial inputs, it's completely hopeless.
If we allow a favorable test, like not warning the human that they might be talking to an AI, then in theory even ELIZA might have passed it a half-century ago. It's easy to fool people when they're expecting a human and not looking too hard.
Well, given that benchmarks show that we now have "super-human" AI, let's go! We can do everything we ever wanted to do, but didn't have the manpower for. AMD drivers competitive with NVIDIA's for AI? Let's do it! While you're at it, fork all the popular backends to use it. We can let it loose in popular OSes and apps and optimize them so we're not spending multiple GB of memory running chat apps. It can fix all of Linux's driver issues.
Oh, it can't do any of that? Its superhuman abilities are only for acing toy problems, riddles and benchmarks? Hmm.
Don't get me wrong, I suppose there might be some progress here, but I'm skeptical. As someone who uses these models, every release since the CoT fad kicked off didn't feel like it was gaining general intelligence anymore. Instead, it felt like it was optimizing for answering benchmark questions. I'm not sure that's what intelligence really is. And OpenAI has a very strong need, one could call it an addiction, for AGI hype, because it's all they've really got. LLMs are very useful tools -- I'm not a luddite, I use them happily -- but OpenAI has no particular advantage there any more; if anything, for its strengths, Claude has maintained a lead on them for a while.
Right now, these press releases feel like someone announcing the invention of teleportation, yet I still need to take the train to work every day. Where is this vaunted AGI? I suppose we will find out very soon whether it is real or not.
- Prev
- Next
As someone who is not nearly as impressed with AI as you, thank you for the Turing test link. I've personally been convinced that LLMs were very far away from passing it, but I realize I misunderstood the nature of the test. It depends way too heavily on the motivation level of the participants. That level of "undergrad small-talk chat" requires only slightly more than Markov-chain level aptitude. In terms of being a satisfying final showdown of human vs AI intelligence, DeepBlue or AlphaGo that was not.
I still hold that we're very far away from AI being able to pass a motivated Turing test. For example, if you offered me and another participant a million dollars to win one, I'm confident the AI would lose every time. But then, I would not be pulling any punches in terms of trying to hit guardrails, adversarial inputs, long-context weaknesses etc. I'm not sure how much that matters, since I'm not sure whether Turing originally wanted the test to be that hard. I can easily imagine a future where AI has Culture-level intelligence yet could still not pass that test, simply because it's too smart to fully pass for a human.
As for the rest of your post, I'm still not convinced. The problem is that the model is "demonstrating intelligence" in areas where you're not qualified to evaluate it, and thus very subject to bullshitting, which models are very competent at. I suspect the Turing test wins might even slowly reverse over time as people become more exposed to LLMs. In the same way that 90s CGI now sticks out like a sore thumb, I'll bet that current day LLM output is going to be glaring in the future. Which makes it quite risky to publish LLM text as your own now, even if you think it totally passes to your eyes. I personally make sure to avoid it, even when I use LLMs privately.
More options
Context Copy link