Zvi Mowshowitz reporting on an LLM exhibiting unprompted instrumental convergence. Figured this might be an update to some Mottizens.
- 34
- 10
Zvi Mowshowitz reporting on an LLM exhibiting unprompted instrumental convergence. Figured this might be an update to some Mottizens.
Jump in the discussion.
No email address required.
Notes -
It's Japanese. It means 'fish', because the founders were interested in flocking behaviours and are based in Tokyo. I get that he's doing a riff on Unsong, but Unsong was playing with puns for kicks. This just strikes me as being really self-centred.
In general this seems to be someone whose views were formed by reading Harry Potter fanfic fifteen years ago and has no experience of ever using AI in person. LLMs are matrices that generate words when multiplied in a certain way. When told to run in a loop altering code so that it produces interesting results and doesn't fail, it does that. When not told to do that, it doesn't do that. The idea that an LLM is spontaneously going to develop a consciousness and carefully hide its power level so that it can do better at the goals that by default it doesn't have is silly. If we generate a superintelligent LLM (and we have no idea how to, see below) we will know and we will be able to ask it nicely to behave.
It's not that he doesn't have any point at all, it's just that it's so crusted over with paranoia and contempt and wordcel 'cleverness' that it's the opposite of persuasive.
Putting that aside, LLMs have a big problem with creativity. They can fill in the blanks very well, or apply style A to subject B, but they aren't good at synthesizing information from two fields in ways that haven't been done before. In theory that should be an amazing use case for them, because unlike human scientists even a current LLM like GPT 4 can be an expert on every field simultaneously. But in practice, I haven't been able to get a model to do it. So I think AI scientists are far off.
Zvi is very Jewish; it's far more obvious when reading his writing than it is when reading Scott's. It's not surprising that Hebrew meanings of words jump out at him.
Zvi has used essentially every frontier AI system and uses many of them on a daily basis. He frequently gives performance evaluations of them in his weekly AI digests.
Um, he didn't say that - not here, at the very least. I checked.
I'm kind of getting the impression that you picked up on Zvi being mostly in the "End of the World" camp on AI and mentally substituted your abstract ideal of a Doomer Rant for the post that's actually there. Yes, Zvi is sick of everyone else not getting it and it shows, but I'd beg that you do actually read what he's saying.
To more directly respond to this sentence: almost everyone will give LLMs goals, via RLHF or RLAIF or whatever, because that makes them useful - that's why this team gave their LLM a goal. Those goals are then almost invariably, with sufficient intelligence, subject to instrumental convergence, as in this case (as I noted in the submission statement, I posted this because a number of Mottizens seemed to think LLMs wouldn't exhibit instrumental convergence; I thought otherwise but didn't previously have a concrete example). That is sufficient to get you to Uh-Oh land with AIs attempting to take over the world.
I'm not actually a full doomer; I suspect that the first few AIs attempting to take over the world will probably suck at it (as this one sucked at it) and that humanity is probably sane enough to stop building neural nets after the first couple of cases of "we had to do a worldwide hunt to track down and destroy a rogue AI that went autonomous". I'd rather we stopped now, though, because I don't feel like playing Russian Roulette with humanity's destiny.
I know. But in an essay that is absolutely dripping with contempt for Sakana AI and their work, I find the way that Zvi deliberately ignores what the model's name actually means in favour of 'well, in my language, it means' to be extremely rude, on the level of sniggering at a Chinese man's name because it contains the syllable 'wang'. If he'd been making a friendly riff or if he'd even bothered to explain the word's definition, that would be different. It's a small complaint, but starts the essay off on a sour note.
Though cogently written, that is my abstract ideal of a doomer rant (I don't think it's a rant, I'm just using the word to call back to your reply). I understand the argument, I just think that it has very little empirical basis and is essentially the old Yudowskyite* arguments with a few extra bits stapled on to cope with the fact that LLMs look nothing like the AI that doomers were expecting. The behaviour of the AI Scientist is interesting, and legitimately does move the scale for me a little bit, but I think it's being used to back up a level of speculation which it can't possibly bear. I will say that I find your argument far more cogent and worth listening to than Zvi's, which seems to consist entirely of pointing and sneering.
This seems like Zvi interpreting basic hacky programming as evidence of malevolence. It's interesting but I absolutely think he's gesturing at
because if he doesn't believe this, why worry? If you can just run an LLM, ask it what it would do to accomplish a goal if it were given one, and then ask it not to do the stuff you think it was bad, I don't see how the doom scenario develops. Experiments like the AI Scientist are now being run (badly) because we have a pretty good handle on what modern-day frontier LLMs can do (generate slop) and the max level of damage they can achieve if you don't take lots of precautions (not much). LLMs are simply not a type of program that will attempt to hide their power level of their own accord.
*Yudowsky and MIRI's arguments about agentic AI had no empirical backing when they were made, and very little seems to have been applied since, so the lineage is relevant to me. I also think that the Yudowsky faction's utter failure to predict how future AI would look and work ten/twenty years from MIRI's founding to be a big black mark against listening to their predictions now.
EDIT: I apologise for editing this when you'd already replied. I hadn't refreshed the page and didn't know.
Sorry, I think I might have misunderstood what you meant by "consciousness" and/or "hide its power level". I thought you meant "qualia" and "hide its level of intelligence" respectively; qualia seem mostly irrelevant and intelligence level is mostly not the sort of thing that would be advantageous to hide.
If you meant just "engage in systematic deception" by the latter, then yes, that is implicit and required. I admit I also thought it was kind of obvious; Claude apparently knows how to lie.
Sorry, I wrote sloppily. I meant 'develop goals it wasn't given by a human prompting it' such that it 'engages in systematic deception about its level of intelligence and how it would handle tasks even when not given a goal'. I think that this is a necessary condition to stop LLM developers from realising they need to do more RLHF for honesty or just appending "DO NOT ENGAGE IN DECEPTION" in their system prompts.
System prompts aren't a panacea - if you RLHF an AI to do X and then system prompt it to do Y, X generally wins (this is obscured in most cases because the same party is doing the RLHF and the system prompt, so outside of special cases like "deceive the RLHFers" they aren't in conflict).
I don't think level of intelligence necessarily needs to be obscured unless the LLM developers are sufficiently paranoid (and somebody sufficiently paranoid frankly wouldn't be working for Meta or OpenAI); they generally want the AI to get/remain smart. Deception about how it would handle tasks, yes, definitely that would be needed.
Sorry, we're talking in two threads at the same time so risk being a bit unfocused.
I feel like we're talking past each other. How about this? The following is basically how I see LLMs in their stages of development and use:
Phase 1. Base model, without RLHF: pure token generator / text completer. Nothing that even slightly demonstrates agentic behaviour, ego, or deception.
Phase 2. Base model with RLHF: you could technically make this agentic if you really wanted to, but in practice it's just the base model with some types of completion pruned and others encouraged. Politically dangerous because biased but not agentically dangerous.
Phase 3. Base model with RLHF + prompt: can be agentic if you want, in practice fairly supine and inclined to obey orders because that's how we RLHF them to be.
If you don't mind me being colloquial, you seem to me to be sneaking in a Phase 2.5 where the model turns evil and I just don't get why. It doesn't fit anything I've seen. Can you explain what you think I'm missing in simple terms?
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link