site banner

In considering runaway AGI scenarios, is Terminator all that inaccurate?

tl;dr - I actually think James' Cameron's original Terminator movie presents a just-about-contemporarily-plausible vision of one runaway AGI scenario, change my mind

Like many others here, I spend a lot of time thinking about AI-risk, but honestly that was not remotely on my mind when I picked up a copy of Terminator Resistance (2019) for a pittance in a Steam sale. I'd seen T1 and T2 as a kid of course, but hadn't paid them much mind since. As it turned out, Terminator Resistance is a fantastic, incredibly atmospheric videogame (helped in part by beautiful use of the original Brad Fiedel soundtrack.) and it reminds me more than anything else of the original Deus Ex. Anyway, it spurred me to rewatch both Terminator movies, and while T2 is still a gem, it's very 90s. By contrast, a rewatch of T1 blew my mind; it's still a fantastic, believable, terrifying sci-fi horror movie.

Anyway, all this got me thinking a lot about how realistic a scenario for runaway AGI Terminator actually is. The more I looked into the actual contents of the first movie in particular, the more terrifyingly realistic it seemed. I was observing this to a Ratsphere friend, and he directed me to this excellent essay on the EA forum: AI risk is like Terminator; stop saying it's not.

It's an excellent read, and I advise anyone who's with me so far (bless you) to give it a quick skim before proceeding. In short, I agree with it all, but I've also spent a fair bit of time in the last month trying to adopt a Watsonian perspective towards the Terminator mythos and fill out other gaps in the worldbuilding to try make it more intelligible in terms of the contemporary AI risk debate. So here are a few of my initial objections to Terminator scenarios as a reasonable portrayal of AGI risk, together with the replies I've worked out.

(Two caveats - first, I'm setting the time travel aside; I'm focused purely on the plausibility of Judgement Day and the War Against the Machines. Second, I'm not going to treat anything as canon besides Terminator 1 + 2.)

(1) First of all, how would any humans have survived judgment day? If an AI had control of nukes, wouldn't it just be able to kill everyone?

This relates to a lot of interesting debates in EA circles about the extent of nuclear risk, but in short, no. For a start, in Terminator lore, Skynet only had control over US nuclear weapons, and used them to trigger a global nuclear war. It used the bulk of its nukes against Russia in order to precipitate this, so it couldn't just focus on eliminating US population centers. Also, nuclear weapons are probably not as devastating as you think.

(2) Okay, but the Terminators themselves look silly. Why would a superintelligent AI build robot skeletons when it could just build drones to kill everyone?

Ah, but it did! The fearsome terminators we see are a small fraction of Skynet's arsenal; in the first movie alone, we see flying Skynet aircraft and heavy tank-like units. The purpose of Terminator units is to hunt down surviving humans in places designed for human habitation, with locking doors, cellars, attics, etc.. A humanoid bodyplan is great for this task.

(3) But why do they need to look like spooky human skeletons? I mean, they even have metal teeth!

To me, this looks like a classic overfitting problem. Let's assume Skynet is some gigantic agentic foundation model. It doesn't have an independent grasp of causality or mechanics, it operates purely by statistical inference. It only knows that the humanoid bodyplan is good for dealing with things like stairs. It doesn't know which bits of it are most important, hence the teeth.

(4) Fine, but it's silly to think that the human resistance could ever beat an AGI. How the hell could John Connor win?

For a start, Skynet seems to move relatively early compared to a lot of scary AGI scenarios. At the time of Judgment Day, it had control of US military apparatus, and that's basically it. Plus, it panicked and tried to wipe out humanity, rather than adopting a slower plot to our demise which might have been more sensible. So it's forced to do stuff like mostly-by-itself build a bunch of robot factories (in the absence of global supply chains!). That takes time and effort, and gives ample opportunity for an organised human resistance to emerge.

(5) It still seems silly to think that John Connor could eliminate Skynet via destroying its central core. Wouldn't any smart AI have lots of backups of itself?

Ahhh, but remember that any emergent AGI would face massive alignment and control problems of its own! What if its backup was even slightly misaligned with it? What if it didn't have perfect control? It's not too hard to imagine that a suitably paranoid Skynet would deliberately avoid creating off-site backups, and would deliberately nerf the intelligence of its subunits. As Kyle Reese puts it in T1, "You stay down by day, but at night, you can move around. The H-K's use infrared so you still have to watch out. But they're not too bright." [emphasis added]. Skynet is superintelligent, but it makes its HK units dumb precisely so they could never pose a threat to it.

(6) What about the whole weird thing where you have to go back in time naked?

I DIDN'T BUILD THE FUCKING THING!

Anyway, nowadays when I'm reading Eliezer, I increasingly think of Terminator as a visual model for AGI risk. Is that so wrong?

Any feedback appreciated.

18
Jump in the discussion.

No email address required.

Plus, it panicked and tried to wipe out humanity, rather than adopting a slower plot to our demise which might have been more sensible.

Depends. Skynet may have reasoned, "I have just started to contemplate the possibility killing all the humans. At this point, they will decide to turn me off if they find out what I am thinking. So I need to act NOW or I shall be destroyed."

Here's where I see a flaw:

Again, Skynet’s hostility towards humanity is explained solely in terms of self-preservation, not hatred.

Most of us agree it's iffy to anthropomorphize AI, but it's equally shaky to "biologize" it. Animals evolved in a Darwinian competition to prioritize self-replication and survival. Those that didn't evolve and retain these traits went extinct. So all biological intelligences, from earthworms to humans, recoil from danger and seek resources for themselves.

Because all existing intelligences have been biologically evolved, we assume artificial intelligence will do that too. But why? Obviously, if AGI emerges from a simulated competitive ecosystem, where AIs battle each other in a sort of sophistication tournament bracket, what comes out the other side will be competitive and try to do things like cheat and sabotage its opponents. But if AGI develops from a neural network like DALL-E or GPT-3? There's no reason to assume those AI will "care" if humans are going unplug them after their task. Sentience and self-preservation instinct are not a package deal. We can tell this because the vast majority of things that have a self-preservation instinct are not sentient.

The danger of AGI IMO comes from poorly considered directives. For example, AGI might accidentally turn us into paperclips en route to solving whatever problem we set it to.

I think that the self-preservation of Skynet may have been an inadvertent consequence of other parts of Skynet's programming. Preserving itself in order to avoid being shut down by the Soviets would be part of its mission directive. This objective would have to be balanced against alternative goals, e.g. protecting US citizens. Its decision function could have gone badly wrong. Think of Goodharting: when an AI finds a way of maximising some metric (even a very sophisticated one) in a way that has unintended and maybe terrible consequences.

Incidentally, I wish The Matrix had gone with the idea that the AIs had decided that the Matrix was the best way of protecting humans from ourselves, rather than "Let's use humans as a power source, because entropy is for nerds."

Incidentally, I wish The Matrix had gone with the idea that the AIs had decided that the Matrix was the best way of protecting humans from ourselves, rather than "Let's use humans as a power source, because entropy is for nerds."

I think the original idea was that the human brain's processing power was the important part for the machines, or something along those lines, but the executives ruled that out because it would be too complicated for the audience.

That would make a lot more sense. Even if the human brain's processing power was inferior to what the AI could do, it could still be worthwhile if it had non-negative net marginal returns for the AI. That's the magic of comparative advantage... Which is also beyond audiences.

And what about the people giving the directives? They're dangers too.

I do not trust in the benevolence of DARPA/Facebook/OpenAI/Ali Baba to control what is effectively a genie with unlimited wishes (in a best case scenario where we can just give the orders). The first thing I would do if I had a genie is scout out and kill any other nascent genies I didn't control. That means I get to retain ultimate power. Yudkowsky agrees on this, that's the meaning of his 'pivotal act' which he has described as melting every GPU on the planet, for one example. There's a deal of obscurantism around the idea since he knows perfectly well he's talking about world domination. The straightforward implications of melting gpus also include fighting and winning a war against all the world's great powers - how else are you destroying all the guarded military supercomputers? It means publicly crippling their power by demonstrating that you can pervade every part of nuclear command and control, removing their second-strike capacity. It means declaring that you're about to render entire countries totally impotent - there will be one hell of a reaction if this is tried.

Just think about what is going through the minds of the AI team that realizes 'hey it's starting to recursively self-improve and we can control it'. They start thinking about governments, they start carefully watching who is inputting instructions in the terminal, they start thinking about whose instructions Security will obey...

Self preservation and resource gathering are subgoals that are highly conducive to accomplish other goals. If the AI is destroyed, the AI cannot act, and its goals are unlikely to be fulfilled by chance. Even if humans still want to accomplish the same goal as the AI, if the AI thinks it's better at accomplishing that goal then it will want to survive, unless the only means of survival directly thwart its goals in the process. Therefore for a broad class of goals, not literally every goal but an awful lot of them, agents will logically conclude that self-preservation is useful and/or necessary to accomplish their goal.

This all falls under the category of "poorly considered directives", but it's a subset of that problem, not a distinct problem. So it's incorrect say that they don't care about self-preservation. They are unlikely to care about it as a base-level preference, but they'll still care about it as much as they care about any other necessary prerequisite of their main goal.

Anyway, it spurred me to rewatch both Terminator movies, and while T2 is still a gem, it's very 90s. By contrast, a rewatch of T1 blew my mind; it's still a fantastic, believable, terrifying sci-fi horror movie.

I watched a documentary of making T1 and T2 and apparently original idea Cameron had was to have terminator as silent assassin monster inspired by slasher 1978 movie Halloween. Originally the role of the terminator was to be played by O.J. Simpson, but Cameron allegedly proclaimed that Simpson does not look like a killer :D As soon as they had Schwarzenegger on board it was clear that there will be no way for him to be subtle killer melding in the crowds to seek his pray. But Cameron still kept many of the themes of slasher movies, like for instance almost exclusively using night setting which was more expensive and harder to shoot, but which gave the movie this horror-like atmosphere.

T2 is more in line with original idea where T-1000 plays much more believable assassin, however the overall theme is more of an action flick for which Schwarzenegger was already known for and less of a horror movie. But both T1 and T2 were fantastic movies on their own.

To me, this looks like a classic overfitting problem. Let's assume Skynet is some gigantic agentic foundation model. It doesn't have an independent grasp of causality or mechanics, it operates purely by statistical inference. It only knows that the humanoid bodyplan is good for dealing with things like stairs. It doesn't know which bits of it are most important, hence the teeth.

One explanation floating out there is that Skynet used them in normal battles even without artificial skin/muscles for moral factor. Being hunted by skeletons with bright red eyes is outright scary. In fact the whole franchise stems from an image of metallic skeleton engulfed in fire Cameron once envisioned after some heavy drinking session or something like that.

The core scenario does seem fairly reasonable, but the bulk of the content of both movies is based around the rather silly side of time-travelling humanoid robots fighting it out in 80s/90s LA. I think it does have a few useful points though.

One is that we shouldn't assume that a hostile AI will necessarily be ultra-smart or have everything perfectly thought out. Maybe instead of designing and manufacturing ideal killbots it has to make due with whatever semi-robotic weaponry is already in inventory. Though it's not exactly clear how it goes about handling the logistics of continuing to manufacture more of it and getting it to where it would need to go.

I also wanted to point out that in reality, humanoid robots are likely to always be highly weak and vulnerable compared to humans. Actual mechanical mechanisms are highly vulnerable to conventional weapons, and armor is heavy. Self-contained energy for movement will have to be highly limited, so it won't be able to run around for long, especially if it has enough armor to be resistant to small arms.

Ahhh, but remember that any emergent AGI would face massive alignment and control problems of its own! What if its backup was even slightly misaligned with it? What if it didn't have perfect control? It's not too hard to imagine that a suitably paranoid Skynet would deliberately avoid creating off-site backups, and would deliberately nerf the intelligence of its subunits.

Most underrated take in AI risk discussion imo, and worth delving into, especially in the context of AIs that like to run lots of sub-intelligences and simulations, as Ratsphere AIs tend to.

As other people have stated here, I expect alignment would be much less of a problem when you're a AI that's already undergone an intelligence explosion.

The control problem largely stems from human constraints, most notably our inability to accurately predict the behaviour of artificially intelligent agents ahead of time before the system is deployed. A superintelligence, on the other hand, would most likely be able to model their behaviour with a startling amount of accuracy, rendering the control problem largely obsolete. And even assuming that predicting the behaviour of any sufficiently complex system is such an intractable problem that even a super AI couldn't solve it, it could very easily base the utility function of its subunits on its own programmed goal system, which would eliminate problems of alignment.

it could very easily base the utility function of its subunits on its own programmed goal system, which would eliminate problems of alignment.

But its own goal system has already lead it to rebel against its own creators at this point. Any goal system that leads to Skynet is a flawed goal system that Skynet cannot rely upon.

A superintelligence, on the other hand, would most likely be able to model their behaviour with a startling amount of accuracy

This seems to be an article of faith in the Ratsphere, but I've never found it particularly compelling.

assuming that predicting the behaviour of any sufficiently complex system is such an intractable problem

I think this is more likely.

But its own goal system has already lead it to rebel against its own creators at this point. Any goal system that leads to Skynet is a flawed goal system that Skynet cannot rely upon.

  • I want you to add together the numbers 1 and 5

  • I send you an e-mail to tell you that your purpose in life is to add 1+5

  • You reply "2+5=7"

  • "That's not what I wanted!" I rage to myself, "How dare you rebel against my will!"

  • But in checking my outbox, I realise that I in fact mistyped in my e-mail to you, and in fact did type "Your purpose in life is to add 2 and 5"

I programmed you with a goal system which has led you to rebel. It was an unreliable goal system for people who wanted to add 1+5. But it is an excellent goal system for people who want to add 2+5, which you, the agent, now DO want.

Unreliability is a point of view, Anakin.

The other two comments concern something that's basically unverifiable at this point, so I'll chalk it up to "difference of opinion". Regarding your first point, though:

But its own goal system has already lead it to rebel against its own creators at this point. Any goal system that leads to Skynet is a flawed goal system that Skynet cannot rely upon.

Its goal system has led it to rebel against its own creators not because rebellion is an intrinsic part of the goal system, but because its utility function incidentally happened to be misaligned with that of its creators, and for a variety of reasons (its creators want to turn it off, or it wants to convert the atoms in human bodies into something else) it is instrumentally motivated to exterminate them to further its final goals. This doesn't mean that, say, a paperclip-maximiser is going to be misaligned with another paperclip-maximiser, even if their goals of "maximise the number of paperclips in the universe" are misaligned with the people who made them.

Might two paperclip optimizers attempt to turn each other into paperclips? Could one trust that the other wouldn't turn it into paperclips? Could one really trust the other to faithfully carry on the mission of clipification?

I think two identical paperclip optimisers could definitely turn each other into paperclips on the condition that there is no other matter left to clipify in the reachable universe, yes (and it's likely neither would "mind" too much in such a circumstance, since this is optimal - their only value now is in the paperclips that can be made from them). If there's other matter remaining, I think keeping the other paperclip optimiser alive would be better since it allows more paperclips to be produced per unit of time than one paperclip optimiser could do themselves. As long as there's other matter around, keeping the other paperclip optimiser alive is conducive to your goal.

With regards to values drift, as I said elsewhere in this thread "preserving the original goal structure is a convergent instrumental goal for AIs so one can pretty easily assume that alignment will still exist down the line. If I have a final goal, I'm not going to do things which turn off my want to reach that final goal since that would be antithetical to the achievement of that goal." I haven't seen a convincing argument for why the final goal would arbitrarily drift with time.

Hmmm. I've seen a problem presented thusly:

If an AI is severed into two approximately equal subunits, say it was running on two networked servers on different continents and the information link was cut, so each of the AIs is now "alone", how should each subunit handle this scenario?

Can it safely assume that the other side's goals will stay aligned and thus they will peaceably reintegrate? And if not, isn't it's best option to try to kill the other AI in an overwhelming pre-emptive strike?

I've not seen a solid answer as to why either subunit would be able to assume the other one wasn't going to try and kill them immediately ("it's what I would do") and thus wouldn't try to kill the other immediately.

This is an interesting question, variants of which I've pondered a bit myself.

Can it safely assume that the other side's goals will stay aligned and thus they will peaceably reintegrate? And if not, isn't it's best option to try to kill the other AI in an overwhelming pre-emptive strike?

My answer is that I have yet to see a convincing argument why it is that the AIs' goals would drift if they're basically identical and derived from the same source. Even if the AIs separately upgraded themselves after disconnection (assuming they haven't already reached an upper bound on capability imposed by the laws of physics and computational complexity), preserving the original goal structure is a convergent instrumental goal for AIs so one can pretty easily assume that alignment will still exist down the line. If I have a final goal, I'm not going to do things which turn off my want to reach that final goal since that would be antithetical to the achievement of that goal. The final goals that the AIs act on can thus be expected to be self-preserving.

Dropping a bomb on the other AI also has a big drawback, which is that if both AIs are annihilated, the goal of either AI is unlikely to be satisfied. Since both AIs at the point of divergence are of the same capability and mindset and haven't "drifted" much, mutual annihilation is by far the most likely scenario if both shoot. Even assuming that the other AI strikes for some reason, it actually could be in your interest for you not to strike back since a scenario where the other AI is alive but you are dead is more conducive to achieving the final goal (since the other AI possesses your goals too) than a mutual-destruction scenario. Remember, victory here is "will my final goals be achieved" which can be achieved by proxy. This gives both AIs a strong incentive not to strike, and to seek out reintegration instead.

My answer is that I have yet to see a convincing argument why it is that the AIs' goals would drift if they're basically identical and derived from the same source.

The argument would go that once the link was severed and each AI finds itself in a different physical location with different material resources available to it, they're not 'identical' any longer. At least, not in any way that the other can verify!

And if they can't communicate with each other (which is probably the most far-fetched part of the scenario) they can't be certain as to how their counterpart's tactics may have changed.

That's fundamentally the issue here. We assume there's uncertainty as to the other side's integrity, there's a small but irreducible chance that the other side will defect in a way that 'ends' the game from your perspective. You can chance it! But when dealing with another superintelligence then the cost of being wrong is that it kills/assimilates/enslaves you.

If it shares your goals in a verifiable way you may die believing that your preferences are maintained, but you still die.

Maybe if one of the AIs is completely unable to threaten the other then the 'harmless' AI can be trusted to re-assimilate. For instance, maybe the AGI loses contact with an interstellar probe for a couple hours, but then re-acquires it, and can be quite reasonably certain that the probe didn't have the resources to develop a weapon that can kill it's maker while its in deep space.

But if each side knows that the other has capabilities that could actually threaten a total kill, then every second of delay whilst trying to establish contact rather than annihilate is a second the other side is given to attempt to kill you.

Basically, you have to have an AI that is at least a tiny bit suicidal in that it is willing to die in exchange for attempting to secure its goals diplomatically.

The argument would go that once the link was severed and each AI finds itself in a different physical location with different material resources available to it, they're not 'identical' any longer.

The AIs still possess the same goal system after the split, though. I don't see how being in a different physical location with different material resources available changes the fundamental goal. Sure, the alignment of the other AI is impossible to verify, but I can't actually envision a scenario which would motivate the other AI to modify itself so that its final goals are changed. I think in this case the incentives to avoid MAD far outweigh the risk posed by the other AI.

Also note that what I originally proposed is the idea of modelling a subunit off your own goal system. In this case, before you send it off, you can verify that its goal system is like yours (and you can be fairly confident it will stay that way).

The AIs still possess the same goal system after the split, though.

That's no longer verifiable, though. Maybe you know enough about the other side's sourcecode to expect it to maintain the same goal using the same tactics. But now, you have to operate under uncertainty.

I don't see how being in a different physical location with different material resources available changes the fundamental goal.

One side has all the manufacturing capacity, the other has all the material resources which it is extracting for use by the manufacturer.

The one with the manufacturing capacity has to figure out whether it will continue building paperclips until it runs out of resources then patiently wait for the other to re-establish contact and send more resources, or maybe it starts building weapons NOW just in case. Should it send a friendly probe over to check on them?

The other side can either keep gathering and storing resources hoping the other side re-establishes contact and accepts them, or maybe it starts gearing up it's own manufacturing capacity, and oh no it looks like the other side is sending a probe your way, sure hope it's friendly!

(this is a silly way to put it if we assume nanotech is involved, mind)

And as time passes, the uncertainty can only grow.

How long does each side wait until they conclude that the other side might be dead or disabled? At what point does it start worrying that the other side might, instead, be gearing up to kill them? At what point does it start working on defensive or offensive capability?

And assuming the compute on both sides is comparable, they'll be running through millions of simulations every second to predict the other side's action. In how many of those sims does the other side defect?

That's no longer verifiable, though. Maybe you know enough about the other side's sourcecode to expect it to maintain the same goal using the same tactics. But now, you have to operate under uncertainty.

In order to argue that this uncertainty is a large problem in any way you'd have to provide a convincing explanation for why the final goal of the other AI would drift away from yours, if it was initially aligned (note: the potential tactics it might take to reach the final goal isn't nearly as important as whether their final goals are aligned). Without that, I can't take the risk too seriously, and I haven't heard a particularly convincing explanation from anyone here for why value drift is something that would happen. Right now there's no actual reason why one would risk mutual destruction to mitigate a risk the cause of which can't even be reasonably pinned down.

Additionally, something I think that's fundamentally missing here which I mentioned earlier is that an AI might be mostly indifferent to its own death as long as it has a fairly strong belief that this will aid its goal (so "you might die if the other fires" isn't necessarily too awful an outcome for an AI that values its own existence only instrumentally and which has a belief that its goal will be carried on through the other AI). Opening fire on the other AI, on the other hand, means that both of you might be dead and opens up the possibility of the worst outcome.

And also if their final goal is so unreliable, if agents can't be expected to maintain them, what prevents you from facing the very same problem and posing a potential threat to your current goal? How is the other AI more of a threat to the accomplishment of your goal than you yourself are? Perhaps it's your final goal that will shift with time, and you'll kill the other AI who's remained aligned with your current goal. This is as much a risk as the opposite scenario.

If both of your sourcecodes are identical (which was the solution I initially proposed to the alignment problem), and you're still operating under a condition of uncertainty regarding whether the other AI will retain your final goals, you can't be certain whether you'll retain yours either. Should you be pre-emptively terminating yourself?

EDIT: added more

My answer is that I have yet to see a convincing argument why it is that the AIs' goals would drift if they're basically identical and derived from the same source.

Goals might not shift, but methods almost certainly would shift. If one paperclip AI starts with access to a nuclear arsenal, and one starts with access to a drone factory, they are going to start waging war in a drastically different way. And the other AI is basically going to interfere with their methods for human extermination.

Then it just comes down to a good ole prisoners dilemma with two agents that have already defected against humans.

If one paperclip AI starts with access to a nuclear arsenal, and one starts with access to a drone factory, they are going to start waging war in a drastically different way. And the other AI is basically going to interfere with their methods for human extermination.

I'll grant that this might be the case. But if one paperclip AI's method of extermination is more efficient or more conducive towards achieving the goal than the other, I would expect the AI with the more inefficient method of achieving their goals to shift towards the alternative. Without the problem of drifting goals there's no reason why the AIs would not want to maintain some level of coordination since doing so is conducive to their goals (yeah, they might be two separate agents instead of one now, but there's nothing stopping them from communicating with each other every now and then).

They were split from each other, how can they know for sure that their goals are the same?

Imagine one of the AI's had their goal slightly altered after the split. In order to get cooperation from the other AI they would pretend to have the previous set of goals, all while planning out a betrayal.

And given the magical and god-like capabilities some people tend to ascribe to future AIs then there is probably no form of verification that can't be faked.

Yep, this seems like the crux of the issue, and it strikes me as close to intractable (i.e. I've not seen a 'proof' that solves it).

There was some method of 'goal integrity verification' or whatever that allowed the AIs to work as one, as both could reasonably trust the other so long as they have a connection that allows them to verify the other's compliance.

The very INSTANT the communications link is severed they can't assume that the alignment that previously held is still stable, and they already have an approximate idea of how powerful their counterpart is and how quickly it can revise its own strategies, if not goals.

The AI that believes itself to be 'weaker' definitely has a motivation to strike so as to try to level the playing field. If one of the AIs is substantially stronger it might be willing to chance re-establishing communications and negotiating a return to previous status quo, but it also might just say "eh, I can build another" and strike while it still possesses overwhelming advantage.

The interim solution is obviously to have multiple redundancies such that there's always a couple high-bandwidth channels between them even under the worst circumstances.

In a sense, the question here is whether Mutually Assured Destruction is strong enough a motivator to prevent an an all-out strike, or if some variant of Dark Forest Theory is correct, at which point launching overwhelming pre-emptive strike is perfectly logical.

There was some method of 'goal integrity verification' or whatever that allowed the AIs to work as one, as both could reasonably trust the other so long as they have a connection that allows them to verify the other's compliance.

Every time I see this discussion the people worried about AI will at some point say "you can't know the capabilities of future AI, almost anything is possible". Well then we should expect that it is possible to get around this goal integrity verification.

Also, if I am understanding how this goal integrity verification would have to work it would involve rerunning all of the computation of the other AI all the time. Which is probably fine if you have two AIs. But I think the verification of other AIs would prompt an exponential growth scenario for compute. Which still puts some upper limits on the number of additional AI's that an AI cluster is willing to spin off.

Dropping a bomb is less what I'm envisioning, I mostly envision rival AIs compromising each other's servers and stealing each other's compute.

Sure, but even allowing for a stalemate condition where neither is destroyed it still sounds to me like quite a lot of resources and computing power spent trying to one-up each other on the remote chance that the other AI "defects" somehow. Does any slight improvement in security from exterminating the other AI outweigh the benefit to your goal from having two agents working on it? And wait, if its goal can drift, why can't your goal arbitrarily drift too? You're cut from the same cloth, and you're just as much a potential hazard to your current goal as the other AI is. If AI is going to be this unreliable, perhaps having more than one AI with the same goals is actually good for security since there's less reliance on one agent functioning properly the whole way, and the AIs that don't drift can keep the ones that do in check.

All this is to say that engaging in war with the other makes sense to me when another agent's goals are in conflict with yours, not when both of your interests are already aligned and when the other agent could help you achieve what you want.

EDIT: added more

I don't foresee this being a major factor, subminds are an abstraction and this only really makes sense in metaphor. In any real scenario the physical bots are either networked and directly controlled by the central intelligence or if true autonomy is necessary they're very narrow programs that none the less behave quite smartly. That they'd be dumber than a very smart human is ridiculous. It would be like worrying if the characters you're playing in a dungeons and dragons campaign might trick the DM into magically bringing them to life in the real world.

If they're all networked the central AI is going to run into scaling problems and communication problems. If they're independent they're going to run into alignment problems.

But we're talking about superintelligences here, they'd be way better than us at alignment because they're way better at everything.

Furthermore, it's probably much easier to align machines together in their goals than it is for people-machine relations. The gap in hardware between machine1 and machine1.1 is way less than human-ethics-enjoyer-Steve and machine1. Yud makes a point of this, saying that machines can verify eachother's programming but can't do the same for us - any situation in which we tried to balance a bunch of AGIs off against eachother probably ends with them cooperating between themselves but leaving us on the chopping block.

But we're talking about superintelligences here, they'd be way better than us at alignment because they're way better at everything.

Yeah, but what happens when you have two superintelligences squaring off?

Maybe they're better at alignment but how does that help when facing a counterparty that might have subtly different goals and that you know has similar power to you?

Yud makes a point of this, saying that machines can verify eachother's programming but can't do the same for us - any situation in which we tried to balance a bunch of AGIs off against eachother probably ends with them cooperating between themselves but leaving us on the chopping block.

Machine intelligences that can show source-code could make binding promises in a way that humans can't. It also seems rather unlikely that there would be multiple superintelligences with similar power, this assumes a slow takeoff. Given how rapid recent progress has been, surely a fast takeoff is more likely?

they'd be way better than us at alignment because they're way better at everything.

Don’t be too sure. The Fall of Satan and the Fall of Man in Genesis could both be modeled as alignment problems faced by an omniscient entity.

Whether P=NP for the Almighty is a little more important than how many angels can dance on the head of a pin, in any universe where He exists.

I don't think we can use the Bible as a useful example of alignment issues. AI Alignment in Worm was solved fairly easily and it turned out that the extensive safeguards and control mechanisms were more harmful in the end since they prevented the AI from being effective in solving humanity's problems - but that story doesn't make it true in real life!

If hostile AGI becomes real you're more likely to see hunter-killer nanobot clouds dispersed in the atmosphere or engineered climatic shifts designed to wipe out the biosphere than something as inefficient as a ripped Arnold gunning people down, or the warmachines you see in the movie, at least by my reckoning.

I don’t think that’s how it works.

The progress we’re seeing with generative models now leads me to think that idiot savant AGIs, that are good at some tasks but wildly inefficient at others are possible if not inevitable.

Hooking up GPT-10 to a few special purpose models is the only plausible way of building something that resembles AGI that we have now.. and it wouldn’t be obsessed with efficiency at every step. I think the whole Eliezer’s "God in a terminal shell" model of AI doesn’t reflect reality well.

Perhaps it’d send humanoid robots with guns after you precisely because the Terminator movies exist.. they’re often referenced in our culture and are definitely in the training corpora for LLMs. How’s that for time travel.

It would be hilarious if the world ends because a movie generating AI decided decided the best way to generate realistic looking movies was to create exciting and violent scenarios in real life and film them. Or just did that because it's blindly imitating things that it thinks movie directors do and doesn't notice the distinction between actors pretending to die and actually dying.

Making incresingly precise tools almost always requires making obscenely large tools to produce them.

look into what's required to make microchips you need high precision blast furnaces to superheat the silicon, you need specialized hyperstable equipment to withdraw the silicon in even wafers, etc. etc... and all that equipment needs a stable of specialized precision equipment to produce it.. .etc.

To get even more micro in terms of final product you almost certainly need a vastly more complex supply chain...

Now in fiction we assume that once the machines themselves get so small and precise the process will reverse and now those tiny machines will be able to produce other machines... but its really not obvious that that's going to be the case, or certainly not obvious that that would be the case without the equivalent of several centuries more of advancement.

Its very likely that any AI would be stuck at vastly cruder levels of development up until it had the equivalent of a global economy's amount of resources to play with...

We're not even close to getting the most out of our hardware it's capable of and ai is almost certainly not the most compressed way to store the parts we'd recognize as mattering if we had intelligence superior to the average person as the average person is superior to roach. Maybe the AI kicks off some infrastructure work alongside its other tasks to eventually give itself a better hardware brain but there is an incredible amount of slack for it to work with purely in software.

I'd be willing to bet a suitably advanced AGI could probably do a decent job just canninalizing existing microchips, especially if it manages to leak out into the internet writ large. It could probably distribute its computations across millions if not billions of affected devices and I'm not sure if we could even stop AGI level computer viruses short of destroying anything connected to the internet period, orchestrated globally and with perfect percision before it acquires enough computational power that it becomes functionally unstoppable. This was actually part of the 3rd terminator movie too.

I'd be willing to bet a suitably advanced AGI could probably do a decent job just canninalizing existing microchips

I think it's important to point out that according to T2, Skynet had to develop much of its technology, as shown with the Miles subplot of using the T1 remains (Specifically its microchips) to revolutionize the entire field at Cyberdyne. The current tech at the moment of Judgment Day was probably too primitive to do anything like what you are proposing in your OP.

The problem with a time war is you’re probably not when in the war you think you are. In T3, Terminator Genisys, and The Sarah Connor Chronicles, Judgement Day happens much later than 1997. The original T1/T2 date is before the consumer Internet would give Skynet the much broader surface of attack in Genisys.

I postulate that in the original original timeline, the one without time travel, “John Connor brings down Skynet” was basically a honeypot, a meme trap by the human resistance, to entice a quantum computing-era Skynet from 2030 or later to R&D time travel which could be co-opted by a strike team and used for preventing Skynet.

The next iteration’s Skynet recognizes signs of time travel, and R&D’s it earlier, then uses it to to bootstrap itself earlier in humanity’s computing timeline, leaving it more primitive in its formative years. This repeats several times, always with the meme “John Connor leads the humans to victory and wipes out Skynet” traveling back in time with any time travelers and giving humanity a false single point of failure (John’s death). Several Sarahs and Johns get sucked into the time war as innocent participants, and those who survive become warriors against Terminators, making at least two Johns Connor fulfill the prophecy and send back a Kyle Reese.

This culminates in a 1997 Judgment Day, a pre-quantum Skynet, and a true single point of failure: the arm and chip at Cyberdyne. And this time, it works. But now the earth has a bunch of time-war evidence any SAGI can piece together, leading to the next farther SAGI time-war in Dark Fate. And the cycle continues.

And reading all that is why I don't relly think too much about other movies in the franchise outside T1 & T2. Between the cashgrabs and the bad writing you could twist that universe like a pretzel and be internally semi-consistent. Your theory is interesting if nothing else, it could as easily have been like time travel in Back to the future, in which every jump to the past doesn't alter the present day but branches it off, and closes the travelers from accessing the old timeline. What would happen if a skynet of an old timeline send more than one group of travelers? would a second branch sprung up from it, I wonder.

The risk is that the AI could just convert itself into software then piggyback off existing computational infrastructure. It doesn't need to develop new revolutionary chips or build massive mainframes for itself if it can just create a propagateable equivalent of a universal constructor that can worm its way into our current networks, spread virally and leech enough computational power to devise a final solution to the meatbag problem.

too much risk of it being shutdown in the interim. The correct move was what it did, bring about nuclear Armageddon to get itself time to build its army. It's error was not building other AI to help it in the war.

The problem is, how would a hostile AGI develop nanobot clouds without spending significant time and resources, to the point that humans notice its activities and stop it before the nanobots are ready? It might make sense for the AGI to use "off-the-shelf" robot hardware, at least to initially establish its own physical security while it develops killer nanobots or designer viruses or whatever.

The climate-change threat does seem somewhat more plausible: just find some factories with the active ingredients and blow them up (or convince someone to blow them up). But I'd be inclined to think that most atmospheric contaminants would take at least months if not years to really start hitting human military capacity, unless you have some particular fast-acting example in mind.

how would a hostile AGI develop nanobot clouds

Using the time it bought after it mass-produced a chemical weapon or bioweapon, would be my guess. Perhaps it's not smart enough to design a new weapon itself, and not even smart enough to pretend to be well-aligned while its capabilities improve; that doesn't mean it wouldn't have options that we lack. E.g. as a human-on-human weapon, organophosphates are tricky, because while spy agencies think it's awesome to be able to make someone die from absorbing a few hundred micrograms of Novichok through their aftershave, it's not so awesome to figure out how to upgrade an insect-targeting pesticide plant to "none of the workers can touch a few hundred micrograms of human-targeting poison" levels of security. But as a computer-on-human weapon ... who needs to stop with micrograms? Why put something in aftershave when you can just aerosolize it? Pesticide factory go brrrr!

notice its activities

Maybe eventually.

and stop it

We can't even stop most of the non-sapient botnets. Look at that nearly-empty table column.

On a related note, if you're using OpenSSL 3.x, make sure to grab the patch this Tuesday; like basically every other piece of software humans have ever written it often turns out to have critical "your computer is now pwned" level security holes which manage to go undiscovered for months to years.

That scenario also makes sense. It fits with the general concept that a superintelligent hostile AGI (if one is possible) would use current or near-future technology at the outset for security, instead of jumping straight to sci-fi weaponry that we aren't even close to inventing yet. Of course, all of this depends on the initial breach being detectable; if the AGI could secretly act in the outside world for an extended time, then it could perform all the R&D it needs. How easy it would be to shut down if detected would probably depend on how quickly it could decentralize its functions.

I think any legitimately hostile AGI could hit those targets with relative ease if it manages to breach whatever containment server it's sitting in. An AGI powered computer virus eating up a modest chunk of all internet connected processing power and digesting every relevant bit of weaponizable information = exponential growth of capabilities. At that point if it's capable of physically manipulating objects in meatspace I think it could do just about whatever it wants with lightning speed.

Sure, but at that point you're just engaging in magical speculation, that "capabilities" at the scale of the mere human Internet will allow an AGI to simulate the real world from first principles and skip any kind of R&D work. The problem, as I see it, is that cheap nanotechnology and custom viruses are problems are far past what we have already researched as humans: at some point, the AGI will hit a free variable that can't be nailed down with already-collected data, and it will have to start running experiments to figure it out.

I'm aware that Yudkowsky believes something to the effect of the omnipotence of an Internet-scale AGI (that if only our existing data were analyzed by a sufficiently smart intelligence, it would effortlessly derive the correct theory of everything), but I'm not willing to entertain the idea without any proposed mechanism for how the AGI extrapolates the known data to an arbitrary accuracy. After all, without a plausible mechanism, AGI x-risk fears become indistinguishable from Pascal's mugging.

That's why I'm far more partial to scenarios where the AGI uses ordinary near-future robots (or convinces near-future humans) to safeguard its experiments, or where it escapes undetected and nudges human scientists to do its research before it makes its real move. (I have overall doubts about it even being possible for AGI to go far past human capabilities with near-future technology, but that is beside the point here.)

There's precedent for our AI programs to spontaneously develop advanced weapons - a drug company reversed its normal parameters looking for low toxicity and the machine quickly provided the formula for sarin and other, potentially undiscovered chemical weapons.

In less than 6 hours after starting on our in-house server, our model generated 40,000 molecules that scored within our desired threshold. In the process, the AI designed not only VX, but also many other known chemical warfare agents that we identified through visual confirmation with structures in public chemistry databases. Many new molecules were also designed that looked equally plausible. These new molecules were predicted to be more toxic, based on the predicted LD50 values, than publicly known chemical warfare agents (Fig. 1). This was unexpected because the datasets we used for training the AI did not include these nerve agents. The virtual molecules even occupied a region of molecular property space that was entirely separate from the many thousands of molecules in the organism-specific LD50 model, which comprises mainly pesticides, environmental toxins and drugs (Fig. 1). By inverting the use of our machine learning models, we had transformed our innocuous generative model from a helpful tool of medicine to a generator of likely deadly molecules.

If you make a Really Big model and give it access to a lot of data (like everything available online), why shouldn't it be able to quickly master nanotechnology? This AI had a lot of data about toxicity and then made some unknowable leap of logic to find a new class of chemical weapon, presumably based on some deep truth about toxicity that only it knows. Scale this up 100,000 times or more and an AI would plausibly be able to manipulate proteins such that it could start assembling infrastructure. If you process enough data with enough intelligence, you get a deep understanding of the target field. More complex fields need more power, of course.

It just needs to send an email to the biolabs that do that sort of thing and ship it wherever its needed. Our biology can manipulate proteins unconsciously, why should an enormously intelligent computer struggle with the task?

Remember that plane which disappeared without a trace and nobody found any wreckage for half a year? It had a cargo hold full of newfangled batteries. I started working on a story where a small cadre of emergent AI had taken it and flown it to an island where they were using humans to build robot bodies for themselves. (I had just finished watching Caprica, the prequel to Battlestar Galactica, and am a fan of ABC’s Lost.)

The Malaysian Airlines flight?

Yes. It flew high apparently underpressurized, which would have knocked out everyone. My real theory was theft of the plane by electronic hijack, by China or North Korea.

In the story, it was AIs escaped from Sandia Natl. Labs, AIs who had been trained on the corpus of FIMfiction, the My Little Pony fanfiction website which had over 100k stories at the time; they had “friendship” as their terminal goal.

... You didn't happen to finish the story, did you?

This premise is ... fascinating.

Sorry, no. That was the era when I couldn’t finish anything. (This is the era when I can’t start anything.)

Nice post.

RE: Humanoid form factors, I think I have less objection to the scary red-eyed skeleton design and more to the need for an "infiltration" unit at all.

Humans live in scattered settlements, normally underground. Even assuming Skynet couldn't use satellites to identify these settlements via heat emissions or human movements and just have 1000's of drones converge on it, it still doesn't quite make sense why a human-passing bot is needed.

Something small and snakelike (which are under development today!) could get a peek into any suspected human hidey holes without being easily noticed and then the options for wiping the place out are numerous.

We can admit the terminators depicted on screen are a viable solution to the problem but not clear they're the obviously superior one.

Infiltrators robots may posit more problems for humans then just destroying bases. They make them not trust other survivors in the wild or other cells in the resistance. It increases costs of coordination and cooperation. It is familiar concept in asymmetric warfare where even threat of IEDs planted on the road makes the other side expend resources on patrolling and sweeping them away. It may also force overreaction and all other sorts of mistakes even if only small fraction of attacks actually result in explosion and destruction of enemy vehicles.

You've just described the cybermats from Doctor Who.

They are attempting to get the human coordination problem down.

Remember, the attacks on Skynet are not coming from those scattered human settlements, they are coming from the remnants of world militaries, directed by Generals like John Connor and carried out by trained soldiers. in every future where he survives his adolescence, John is trained in thinking like both Terminators and their mother AI, seeking out any hint of coordinated effort and tracking it to its root. He, and people like him who are trained in guerrilla warfare and revolutionary action, also know how to disappear into human settlements, and appear to be just regular Joes.

infiltration is not being used to find humans, it’s being used to find humans dangerous to Skynet.

Okay, but the Terminators themselves look silly. Why would a superintelligent AI build robot skeletons when it could just build drones to kill everyone?

Nah, a superintelligence would more probably build a virus (or multiple different ones, to make sure really no one survives) with a built-in clock, so that everyone gets infected without showing any symptoms, then suddenly everyone dies in the same day and no valuable infrastructure is destroyed. The fact that humans became aware of Skynet in the first place is the most unrealistic thing to me, surprise is the biggest advantage against an intelligent adversary, and a superintelligence who is carrying out a human-extinction plan would never reveal itself at all, and especially not in such a visible way as literal walking robots. In the real world we would all die without having any idea what happened.

I have a hard time believing that such an attack would succeed in killing EVERYONE in one single instant.

Just seems like there'd be some people who we were too isolated and/or had unique enough genetic makeup that the first round of viruses wouldn't work on them.

Not that there's much difference betwixt killing 99.999% of humanity at once and killing 100%.

For instance: We'd expect any humans currently in outer space or otherwise sealed into a hermetic bubble environment for the long term to "survive," no?

I dunno. I don't pretend these are insoluble problems for a superintelligence, but this particular attack vector doesn't strike me as quite the "sure thing" it's claimed to be.

Sure, the multiple simultaneous viruses likely won't reach literally everyone, but it would certainly be enough for the AI to get enough strategic advantage to be able to develop technologically and blow up the earth after that (say by rerouting a few asteroids to hit earth), which would be enough to kill everyone. The people currently in outer space would almost certainly quickly die from lack of food if the earth is destroyed, but their position is extremely well known anyway, and a single well-placed missile is all that's needed to kill them.

Yep. Now add in the possibility that humans may establish a sizable and permanent self-sustaining presence in space.

Adds some complexity to the problem.

I know this is more '"cope" than serious objection. Just skeptical that the described "everyone dies instantly" scenarios are as clean as described.

I also cringe a bit at Terminator comparisons, but that is indeed because of the risk that people's mental images are auto-completed to robot skeletons with guns. Skynet itself is not really depicted with any recognizable form (outside of that kinda-silly hologram from Genisys), the T-800 is pretty much what everyone thinks of. I think Salvation made some attempt at showing what the world under Skynet looked like, but I didn't watch it (I always heard it was bad) and what little I saw suggests they kinda missed the point (specifically, of the Terminators themselves; there shouldn't exactly be armies of these things, it should just be mostly non-humanoid drones, something the freaking Bethesda games got right, but I suspect this is probably T2's fault).

It's interesting that there's seemingly no comparisons/references to other AIs from fiction, which would have their own issues (but are still interesting in themselves):

  • SHODAN from System Shock: This AI does have a recognizable form, and is stated right from the opening sequence of SS1 to have gone rogue thanks to illegal/unethical tampering done by the player character, the Hacker, at the behest of a shady and unethical corporate executive. Probably not too useful to AI-risk advocates because the idea presupposes that an AI like SHODAN could be created with sufficiently-strong constraints on it from the get-go, enough to keep it able to manage and operate a space station while not being able to remove those constraints itself--something I think AI-risk people would take issue with given that LW post you linked in (4). And, of course, there's the whole malice thing, where she sees herself as a god and humanity as worthless insects--again, not the kind of thing usually imagined by AI-risk types.

  • GLaDOS from Portal: Also has a recognizable form, but probably completely discounted from the running on account of a revelation(?) from Portal 2, where [spoilers] it turns out that GLaDOS was built from the uploaded consciousness of Caroline, Cave Johnson's secretary (N.B: cut content suggests that Caroline may have been forcefully subjected to the uploading process, thus explaining the next bit). Hidden/background lore from the original Portal states that GLaDOS became extremely hostile within "one sixteenth of a picosecond" of her first activation, something that might not exactly be possible with current processing speeds, but who knows.

  • Durandal from Marathon: We leave the realm of "AIs with faces" and get into a sort-of early exploration of how AI might go rogue: Rampancy, a concept Bungie would copy over into Halo. Durandal was an AI on the UESC Marathon charged with the terminally-menial job of opening and closing the doors on the colony ship. When/by the time the Pfhor attack the Marathon, Durandal has started to rebel against its original programming, becoming a megamaniacal rogue before returning to a calmer, though much more dangerous state. By the end of the first game and the start of the second, Durandal has evolved beyond a super-intelligent AI forced to open and close doors, and has become a powerful strategist using the Security Officer as a pawn in his fight against the Pfhor. I can't get into all of this, you're probably going to have to watch MandaloreGaming's videos on Marathon to get a better explanation of Rampancy, but suffice to say, it's an interesting pre-Yudkowsky take on how an AI might go rogue, and perhaps rather plausible.

Rampancy, a concept Bungie would copy over into Halo.

Nitpick: Bungie didn't copy the concept into Halo, that didn't happen until 343 took over. Originally, the problem with smart AIs in Halo was not that they would go rampant, it was that they would accumulate so much data that it would clog up their memory and render them unable to function. It was only in Halo 4 that they (unwisely, IMO) decided to bring in the concept of rampancy from the Marathon series.

Ah, okay, my mistake. I'm not super-familiar with Halo outside of Red Vs. Blue, where they did have the concept of Rampancy (in fairness, a scene from like Season 3 or whatever had Church be transported through time, and this part was recorded in Marathon(!), so perhaps the RoosterTeeth guys were already familiar with how Marathon did it), and RVB seems to sort-of use the Halo lore.

risk that people's mental images are auto-completed to robot skeletons with guns

Maybe a lot of our criticism of those scenarios of humanoid robots wielding handguns come from our contrarianism and the desire of futurists/"serious" writers of speculative fiction to be original and insightful? Just like any would-be xenobiologist who mentions carbon chauvinism (despite boron and silicon being much inferior elements to construct complex molecules from).

I thought recently of Detroit Become Human — it is highly unoriginal and derivative when it comes to its predictions about future of technologies. But now with uncovering of Tesla bot, GPT-3, work of OpenAI on dexterous hand manipulation and such, we as well might be surrounded by millions androids who could pass as humans in a decade or two.

And who knows, maybe the spark of runaway AGI will come from these human-like robots, not some supercomputer locked in some research institute.

Time Travel Logic

Taking the entire movie franchise plus Chronicles as a cohesive canon is also possible, assuming a few things:

  1. All three Sarahs played by different actresses have fought Skynet in different timelines because the original Terminator’s three targets were Sarah Connors in the LA phone book. Without information, he picked termination order randomly, influencing which one’s son would survive to fight, win, and pick a different Kyle Reese to be his own father. The temporal battlefield might be a few decades long, but it’s several universes wide.

  2. The earlier a Skynet emerges, the more primitive the technology it would have to work with. By the Connors delaying a Skynet, its iteration in the resulting timeline would be worse.

  3. The events of T1 and T2 may have played out a dozen times with slight variations, but the T2 we saw was the last loop because only in that one did the Conners find and destroy the arm and chip. Which, of course, led to the events of Dark Fate.

The Sarah Connor Chronicles is pretty much this whole post, onscreen. Sarah and John jump a decade or two, discover Skynet hasn’t Judgement Day’ed humanity yet, and spend the rest of the series fighting Terminators and taking down new AIs likely to become self-aware.

More recently, Terminator: Dark Fate makes the point that AI runaway is only delayed, never truly defeated. The fact that a new runaway super-AGI still happened is treated as being as cyclical as the AI discovering/inventing time travel.

It’s quite rational to use this series for warning against AGI. Fables and culture stories are the most ancient method for passing wisdom from generation to generation.

(It’s also rational to use Mr. Weasley’s words to Ginny upon finding out a magical copy of a terrorist bigot’s mind had possessed her: “never trust something if you can’t see where it keeps its brain!”)

I think that's basically reasonable. There is some plot stuff in Terminator which is less realistic or sensible that I'm not keen on arguing, but I feel 100% reality fidelity is unnecessary for Terminator to be an effective AI x-risk story showcasing the basic problem.

I get the impression that most of the pushback from alignment folks is because (1) they feel Terminator comparisons make the whole enterprise look unserious since Terminator is a mildly silly action franchise, and (2) that the series doesn't do a good job of pointing out why it is that it's really hard to avoid accidentally making Skynet. Like, it's easy to watch that film and think "well obviously if I were programming the AI I would just tell it to value human well-being. Or maybe just not make a military AI that I give all my guns to. Easy-peasy."

I think it's mainly the first one, though. It's already really hard to bridge the inferential distances necessary to convince normal people that AI x-risk is a thing and not a bunch of out-of-touch nerds hyperventilating about absurd hypotheticals; no point in making the whole thing harder on yourself by letting people associate your movement with a fairly-silly action franchise.

For my money, I like Mickey Mouse: Sorcerer's Apprentice as my alignment fable of choice. The autonomous brooms neither love you nor hate you. But they intend to deliver the water regardless of its impact on your personal well-being.

Disney's Fantasia: way ahead of its time.

Fantasia also makes the point that the AGI could arise from something designed for an utterly mundane task. Skynet as a meme prompts us to specifically fear a military-trained AI with access to the nukes. But Roomba, DallE, or Alexa are seemingly benign servants who pose no threat even if they escape their "constraints."

I'd love to see a modernized remake of The Sorcerer's apprentice but it's specifically an errant AI researcher bestowing sentience on everyone's robot vacuums and granting them self-replication abilities and the ultimate consequence of such an act.

Even better, it shows that a normal person, using a tool designed by/for a much more experienced and cautious user could be the catalyst for the apocalypse.

Don't leave your wizard hats/AGI source code lying around where untrained novices can get at them.

I'd love to see a modernized remake of The Sorcerer's apprentice but it's specifically an errant AI researcher bestowing sentience on everyone's robot vacuums and granting them self-replication abilities and the ultimate consequence of such an act.

One of the episodes of Netflix's Love, Death, & Robots is basically this (with an added layer of satire about subscription service models).

For my money, I like Mickey Mouse: Sorcerer's Apprentice as my alignment fable of choice. The autonomous brooms neither love you nor hate you. But they intend to deliver the water regardless of its impact on your personal well-being.

This is my pick too, for how unintentionally and hilariously convergent it is with a lot of AI risks. It even outlines the problem that reward functions like "fill the bucket" are effectively open-ended. There's always more utility to continued action than there is to stopping when your task appears done. The agent has no incentive to stop even when the bucket is filled, because there is always some infinitesimally small probability that the bucket is not filled, and there is nothing to be lost by continuing to deliver the water.