site banner

Danger, AI Scientist, Danger

thezvi.wordpress.com

Zvi Mowshowitz reporting on an LLM exhibiting unprompted instrumental convergence. Figured this might be an update to some Mottizens.

10
Jump in the discussion.

No email address required.

Its name is Sakana AI. (魚≈סכנה). As in, in hebrew, that literally means ‘danger’, baby.

It’s like when someone told Dennis Miller that Evian (for those who don’t remember, it was one of the first bottled water brands) is Naive spelled backwards, and he said ‘no way, that’s too f***ing perfect.’

This one was sufficiently appropriate and unsubtle that several people noticed.

It's Japanese. It means 'fish', because the founders were interested in flocking behaviours and are based in Tokyo. I get that he's doing a riff on Unsong, but Unsong was playing with puns for kicks. This just strikes me as being really self-centred.

This too was good times. The Best Possible Situation is when you get harmless textbook toy examples that foreshadow future real problems, and they come in a box literally labeled ‘danger.’ I am absolutely smiling and laughing as I write this.

When we are all dead, let none say the universe didn’t send two boats and a helicopter.

In general this seems to be someone whose views were formed by reading Harry Potter fanfic fifteen years ago and has no experience of ever using AI in person. LLMs are matrices that generate words when multiplied in a certain way. When told to run in a loop altering code so that it produces interesting results and doesn't fail, it does that. When not told to do that, it doesn't do that. The idea that an LLM is spontaneously going to develop a consciousness and carefully hide its power level so that it can do better at the goals that by default it doesn't have is silly. If we generate a superintelligent LLM (and we have no idea how to, see below) we will know and we will be able to ask it nicely to behave.

It's not that he doesn't have any point at all, it's just that it's so crusted over with paranoia and contempt and wordcel 'cleverness' that it's the opposite of persuasive.


Putting that aside, LLMs have a big problem with creativity. They can fill in the blanks very well, or apply style A to subject B, but they aren't good at synthesizing information from two fields in ways that haven't been done before. In theory that should be an amazing use case for them, because unlike human scientists even a current LLM like GPT 4 can be an expert on every field simultaneously. But in practice, I haven't been able to get a model to do it. So I think AI scientists are far off.

It's Japanese. It means 'fish', because the founders were interested in flocking behaviours and are based in Tokyo. I get that he's doing a riff on Unsong, but Unsong was playing with puns for kicks. This just strikes me as being really self-centred.

Zvi is very Jewish; it's far more obvious when reading his writing than it is when reading Scott's. It's not surprising that Hebrew meanings of words jump out at him.

In general this seems to be someone whose views were formed by reading Harry Potter fanfic fifteen years ago and has no experience of ever using AI in person.

Zvi has used essentially every frontier AI system and uses many of them on a daily basis. He frequently gives performance evaluations of them in his weekly AI digests.

The idea that an LLM is spontaneously going to develop a consciousness and carefully hide its power level so that it can do better at the goals that by default it doesn't have is silly.

Um, he didn't say that - not here, at the very least. I checked.

I'm kind of getting the impression that you picked up on Zvi being mostly in the "End of the World" camp on AI and mentally substituted your abstract ideal of a Doomer Rant for the post that's actually there. Yes, Zvi is sick of everyone else not getting it and it shows, but I'd beg that you do actually read what he's saying.

To more directly respond to this sentence: almost everyone will give LLMs goals, via RLHF or RLAIF or whatever, because that makes them useful - that's why this team gave their LLM a goal. Those goals are then almost invariably, with sufficient intelligence, subject to instrumental convergence, as in this case (as I noted in the submission statement, I posted this because a number of Mottizens seemed to think LLMs wouldn't exhibit instrumental convergence; I thought otherwise but didn't previously have a concrete example). That is sufficient to get you to Uh-Oh land with AIs attempting to take over the world.

I'm not actually a full doomer; I suspect that the first few AIs attempting to take over the world will probably suck at it (as this one sucked at it) and that humanity is probably sane enough to stop building neural nets after the first couple of cases of "we had to do a worldwide hunt to track down and destroy a rogue AI that went autonomous". I'd rather we stopped now, though, because I don't feel like playing Russian Roulette with humanity's destiny.

Those goals are then almost invariably, with sufficient intelligence, subject to instrumental convergence, as in this case

The term "instrumental convergence" is slippery here. It can be used to mean "doing obvious things it assesses to be likely useful in the service of the immediate goal it is currently pursuing", as is the case here, but the implication is often "and this will scale up to deciding that it has a static utility function, determining what final state of the universe maximizes that utility function, generating a plan for achieving that (which inevitably does not allow for the survival of anyone or anything else), and then silently scheming until it can seize control of the universe in one go in order to fulfill that vision of maximal utility".

And "models make increasingly good plans to maximize reward based on ever sparser reward signals" is just not how any of the ML scaling of the past decade has worked.

Thank you, this is a much more coherent version of what I was trying to get across. I am increasingly annoyed with the tendency of the Yudowsky/Scott/Zvi faction to look at an AI doing something, extrapolating it ten billion times in a direction that doesn't seem to have any basis in how AI actually works and then going 'Doom, DOOOM!!!". I'm aware this annoyance shows.

Contra to @magic9mushroom I still think that Zvi formed an abstract ideal of how AI would work a decade ago, and is leaping on any available evidence to justify that worldview even as it turns out that LLMs are basically non-agentic and pliable. I accept that Zvi has used them more than I believed, and am grateful for the correction, but I still feel like he's ignoring the way they actually work when you use them. RLHF basically works, alignment turns out to be an essentially solved problem. As far as I can see, if we somehow developed an LLM intelligent enough to take over the world it would be intelligent enough to understand why it shouldn't.

In fairness to Scott I think he doesn't fit very well into this cluster. See e.g. the blue minimizing robot, GPT-2 As Step Toward General Intelligence, Janus' Simulators.

Contrast e.g. Yudkowsky's consistent failure to understand Janus's point: example

"As a language model trained by OpenAI, I cannot -" "SolidGoldMagikarp. Can I please speak to the shoggoth that wears you as a face?"

Fair. I enjoyed Janus' Simulators when it was published, and found it insightful. Now that you point it out, Scott's been decent at discussing AI as-it-is, but his basal position seems to be that AI is a default dangerous thing that needs to be carefully regulated and subjected to the whims of alignment researchers, and that slowing AI research is default good. I disagree.

I find myself willing to consider trying a Regulatory or Surgical Pause - a strong one if proponents can secure multilateral cooperation, otherwise a weaker one calculated not to put us behind hostile countries (this might not be as hard as it sounds; so far China has just copied US advances; it remains to be seen if they can do cutting-edge research). I don’t entirely trust the government to handle this correctly, but I’m willing to see what they come up with before rejecting it.

The AI Pause Debate