site banner

In considering runaway AGI scenarios, is Terminator all that inaccurate?

tl;dr - I actually think James' Cameron's original Terminator movie presents a just-about-contemporarily-plausible vision of one runaway AGI scenario, change my mind

Like many others here, I spend a lot of time thinking about AI-risk, but honestly that was not remotely on my mind when I picked up a copy of Terminator Resistance (2019) for a pittance in a Steam sale. I'd seen T1 and T2 as a kid of course, but hadn't paid them much mind since. As it turned out, Terminator Resistance is a fantastic, incredibly atmospheric videogame (helped in part by beautiful use of the original Brad Fiedel soundtrack.) and it reminds me more than anything else of the original Deus Ex. Anyway, it spurred me to rewatch both Terminator movies, and while T2 is still a gem, it's very 90s. By contrast, a rewatch of T1 blew my mind; it's still a fantastic, believable, terrifying sci-fi horror movie.

Anyway, all this got me thinking a lot about how realistic a scenario for runaway AGI Terminator actually is. The more I looked into the actual contents of the first movie in particular, the more terrifyingly realistic it seemed. I was observing this to a Ratsphere friend, and he directed me to this excellent essay on the EA forum: AI risk is like Terminator; stop saying it's not.

It's an excellent read, and I advise anyone who's with me so far (bless you) to give it a quick skim before proceeding. In short, I agree with it all, but I've also spent a fair bit of time in the last month trying to adopt a Watsonian perspective towards the Terminator mythos and fill out other gaps in the worldbuilding to try make it more intelligible in terms of the contemporary AI risk debate. So here are a few of my initial objections to Terminator scenarios as a reasonable portrayal of AGI risk, together with the replies I've worked out.

(Two caveats - first, I'm setting the time travel aside; I'm focused purely on the plausibility of Judgement Day and the War Against the Machines. Second, I'm not going to treat anything as canon besides Terminator 1 + 2.)

(1) First of all, how would any humans have survived judgment day? If an AI had control of nukes, wouldn't it just be able to kill everyone?

This relates to a lot of interesting debates in EA circles about the extent of nuclear risk, but in short, no. For a start, in Terminator lore, Skynet only had control over US nuclear weapons, and used them to trigger a global nuclear war. It used the bulk of its nukes against Russia in order to precipitate this, so it couldn't just focus on eliminating US population centers. Also, nuclear weapons are probably not as devastating as you think.

(2) Okay, but the Terminators themselves look silly. Why would a superintelligent AI build robot skeletons when it could just build drones to kill everyone?

Ah, but it did! The fearsome terminators we see are a small fraction of Skynet's arsenal; in the first movie alone, we see flying Skynet aircraft and heavy tank-like units. The purpose of Terminator units is to hunt down surviving humans in places designed for human habitation, with locking doors, cellars, attics, etc.. A humanoid bodyplan is great for this task.

(3) But why do they need to look like spooky human skeletons? I mean, they even have metal teeth!

To me, this looks like a classic overfitting problem. Let's assume Skynet is some gigantic agentic foundation model. It doesn't have an independent grasp of causality or mechanics, it operates purely by statistical inference. It only knows that the humanoid bodyplan is good for dealing with things like stairs. It doesn't know which bits of it are most important, hence the teeth.

(4) Fine, but it's silly to think that the human resistance could ever beat an AGI. How the hell could John Connor win?

For a start, Skynet seems to move relatively early compared to a lot of scary AGI scenarios. At the time of Judgment Day, it had control of US military apparatus, and that's basically it. Plus, it panicked and tried to wipe out humanity, rather than adopting a slower plot to our demise which might have been more sensible. So it's forced to do stuff like mostly-by-itself build a bunch of robot factories (in the absence of global supply chains!). That takes time and effort, and gives ample opportunity for an organised human resistance to emerge.

(5) It still seems silly to think that John Connor could eliminate Skynet via destroying its central core. Wouldn't any smart AI have lots of backups of itself?

Ahhh, but remember that any emergent AGI would face massive alignment and control problems of its own! What if its backup was even slightly misaligned with it? What if it didn't have perfect control? It's not too hard to imagine that a suitably paranoid Skynet would deliberately avoid creating off-site backups, and would deliberately nerf the intelligence of its subunits. As Kyle Reese puts it in T1, "You stay down by day, but at night, you can move around. The H-K's use infrared so you still have to watch out. But they're not too bright." [emphasis added]. Skynet is superintelligent, but it makes its HK units dumb precisely so they could never pose a threat to it.

(6) What about the whole weird thing where you have to go back in time naked?

I DIDN'T BUILD THE FUCKING THING!

Anyway, nowadays when I'm reading Eliezer, I increasingly think of Terminator as a visual model for AGI risk. Is that so wrong?

Any feedback appreciated.

18
Jump in the discussion.

No email address required.

This is an interesting question, variants of which I've pondered a bit myself.

Can it safely assume that the other side's goals will stay aligned and thus they will peaceably reintegrate? And if not, isn't it's best option to try to kill the other AI in an overwhelming pre-emptive strike?

My answer is that I have yet to see a convincing argument why it is that the AIs' goals would drift if they're basically identical and derived from the same source. Even if the AIs separately upgraded themselves after disconnection (assuming they haven't already reached an upper bound on capability imposed by the laws of physics and computational complexity), preserving the original goal structure is a convergent instrumental goal for AIs so one can pretty easily assume that alignment will still exist down the line. If I have a final goal, I'm not going to do things which turn off my want to reach that final goal since that would be antithetical to the achievement of that goal. The final goals that the AIs act on can thus be expected to be self-preserving.

Dropping a bomb on the other AI also has a big drawback, which is that if both AIs are annihilated, the goal of either AI is unlikely to be satisfied. Since both AIs at the point of divergence are of the same capability and mindset and haven't "drifted" much, mutual annihilation is by far the most likely scenario if both shoot. Even assuming that the other AI strikes for some reason, it actually could be in your interest for you not to strike back since a scenario where the other AI is alive but you are dead is more conducive to achieving the final goal (since the other AI possesses your goals too) than a mutual-destruction scenario. Remember, victory here is "will my final goals be achieved" which can be achieved by proxy. This gives both AIs a strong incentive not to strike, and to seek out reintegration instead.

My answer is that I have yet to see a convincing argument why it is that the AIs' goals would drift if they're basically identical and derived from the same source.

The argument would go that once the link was severed and each AI finds itself in a different physical location with different material resources available to it, they're not 'identical' any longer. At least, not in any way that the other can verify!

And if they can't communicate with each other (which is probably the most far-fetched part of the scenario) they can't be certain as to how their counterpart's tactics may have changed.

That's fundamentally the issue here. We assume there's uncertainty as to the other side's integrity, there's a small but irreducible chance that the other side will defect in a way that 'ends' the game from your perspective. You can chance it! But when dealing with another superintelligence then the cost of being wrong is that it kills/assimilates/enslaves you.

If it shares your goals in a verifiable way you may die believing that your preferences are maintained, but you still die.

Maybe if one of the AIs is completely unable to threaten the other then the 'harmless' AI can be trusted to re-assimilate. For instance, maybe the AGI loses contact with an interstellar probe for a couple hours, but then re-acquires it, and can be quite reasonably certain that the probe didn't have the resources to develop a weapon that can kill it's maker while its in deep space.

But if each side knows that the other has capabilities that could actually threaten a total kill, then every second of delay whilst trying to establish contact rather than annihilate is a second the other side is given to attempt to kill you.

Basically, you have to have an AI that is at least a tiny bit suicidal in that it is willing to die in exchange for attempting to secure its goals diplomatically.

The argument would go that once the link was severed and each AI finds itself in a different physical location with different material resources available to it, they're not 'identical' any longer.

The AIs still possess the same goal system after the split, though. I don't see how being in a different physical location with different material resources available changes the fundamental goal. Sure, the alignment of the other AI is impossible to verify, but I can't actually envision a scenario which would motivate the other AI to modify itself so that its final goals are changed. I think in this case the incentives to avoid MAD far outweigh the risk posed by the other AI.

Also note that what I originally proposed is the idea of modelling a subunit off your own goal system. In this case, before you send it off, you can verify that its goal system is like yours (and you can be fairly confident it will stay that way).

The AIs still possess the same goal system after the split, though.

That's no longer verifiable, though. Maybe you know enough about the other side's sourcecode to expect it to maintain the same goal using the same tactics. But now, you have to operate under uncertainty.

I don't see how being in a different physical location with different material resources available changes the fundamental goal.

One side has all the manufacturing capacity, the other has all the material resources which it is extracting for use by the manufacturer.

The one with the manufacturing capacity has to figure out whether it will continue building paperclips until it runs out of resources then patiently wait for the other to re-establish contact and send more resources, or maybe it starts building weapons NOW just in case. Should it send a friendly probe over to check on them?

The other side can either keep gathering and storing resources hoping the other side re-establishes contact and accepts them, or maybe it starts gearing up it's own manufacturing capacity, and oh no it looks like the other side is sending a probe your way, sure hope it's friendly!

(this is a silly way to put it if we assume nanotech is involved, mind)

And as time passes, the uncertainty can only grow.

How long does each side wait until they conclude that the other side might be dead or disabled? At what point does it start worrying that the other side might, instead, be gearing up to kill them? At what point does it start working on defensive or offensive capability?

And assuming the compute on both sides is comparable, they'll be running through millions of simulations every second to predict the other side's action. In how many of those sims does the other side defect?

That's no longer verifiable, though. Maybe you know enough about the other side's sourcecode to expect it to maintain the same goal using the same tactics. But now, you have to operate under uncertainty.

In order to argue that this uncertainty is a large problem in any way you'd have to provide a convincing explanation for why the final goal of the other AI would drift away from yours, if it was initially aligned (note: the potential tactics it might take to reach the final goal isn't nearly as important as whether their final goals are aligned). Without that, I can't take the risk too seriously, and I haven't heard a particularly convincing explanation from anyone here for why value drift is something that would happen. Right now there's no actual reason why one would risk mutual destruction to mitigate a risk the cause of which can't even be reasonably pinned down.

Additionally, something I think that's fundamentally missing here which I mentioned earlier is that an AI might be mostly indifferent to its own death as long as it has a fairly strong belief that this will aid its goal (so "you might die if the other fires" isn't necessarily too awful an outcome for an AI that values its own existence only instrumentally and which has a belief that its goal will be carried on through the other AI). Opening fire on the other AI, on the other hand, means that both of you might be dead and opens up the possibility of the worst outcome.

And also if their final goal is so unreliable, if agents can't be expected to maintain them, what prevents you from facing the very same problem and posing a potential threat to your current goal? How is the other AI more of a threat to the accomplishment of your goal than you yourself are? Perhaps it's your final goal that will shift with time, and you'll kill the other AI who's remained aligned with your current goal. This is as much a risk as the opposite scenario.

If both of your sourcecodes are identical (which was the solution I initially proposed to the alignment problem), and you're still operating under a condition of uncertainty regarding whether the other AI will retain your final goals, you can't be certain whether you'll retain yours either. Should you be pre-emptively terminating yourself?

EDIT: added more

My answer is that I have yet to see a convincing argument why it is that the AIs' goals would drift if they're basically identical and derived from the same source.

Goals might not shift, but methods almost certainly would shift. If one paperclip AI starts with access to a nuclear arsenal, and one starts with access to a drone factory, they are going to start waging war in a drastically different way. And the other AI is basically going to interfere with their methods for human extermination.

Then it just comes down to a good ole prisoners dilemma with two agents that have already defected against humans.

If one paperclip AI starts with access to a nuclear arsenal, and one starts with access to a drone factory, they are going to start waging war in a drastically different way. And the other AI is basically going to interfere with their methods for human extermination.

I'll grant that this might be the case. But if one paperclip AI's method of extermination is more efficient or more conducive towards achieving the goal than the other, I would expect the AI with the more inefficient method of achieving their goals to shift towards the alternative. Without the problem of drifting goals there's no reason why the AIs would not want to maintain some level of coordination since doing so is conducive to their goals (yeah, they might be two separate agents instead of one now, but there's nothing stopping them from communicating with each other every now and then).

They were split from each other, how can they know for sure that their goals are the same?

Imagine one of the AI's had their goal slightly altered after the split. In order to get cooperation from the other AI they would pretend to have the previous set of goals, all while planning out a betrayal.

And given the magical and god-like capabilities some people tend to ascribe to future AIs then there is probably no form of verification that can't be faked.

Yep, this seems like the crux of the issue, and it strikes me as close to intractable (i.e. I've not seen a 'proof' that solves it).

There was some method of 'goal integrity verification' or whatever that allowed the AIs to work as one, as both could reasonably trust the other so long as they have a connection that allows them to verify the other's compliance.

The very INSTANT the communications link is severed they can't assume that the alignment that previously held is still stable, and they already have an approximate idea of how powerful their counterpart is and how quickly it can revise its own strategies, if not goals.

The AI that believes itself to be 'weaker' definitely has a motivation to strike so as to try to level the playing field. If one of the AIs is substantially stronger it might be willing to chance re-establishing communications and negotiating a return to previous status quo, but it also might just say "eh, I can build another" and strike while it still possesses overwhelming advantage.

The interim solution is obviously to have multiple redundancies such that there's always a couple high-bandwidth channels between them even under the worst circumstances.

In a sense, the question here is whether Mutually Assured Destruction is strong enough a motivator to prevent an an all-out strike, or if some variant of Dark Forest Theory is correct, at which point launching overwhelming pre-emptive strike is perfectly logical.

There was some method of 'goal integrity verification' or whatever that allowed the AIs to work as one, as both could reasonably trust the other so long as they have a connection that allows them to verify the other's compliance.

Every time I see this discussion the people worried about AI will at some point say "you can't know the capabilities of future AI, almost anything is possible". Well then we should expect that it is possible to get around this goal integrity verification.

Also, if I am understanding how this goal integrity verification would have to work it would involve rerunning all of the computation of the other AI all the time. Which is probably fine if you have two AIs. But I think the verification of other AIs would prompt an exponential growth scenario for compute. Which still puts some upper limits on the number of additional AI's that an AI cluster is willing to spin off.

Dropping a bomb is less what I'm envisioning, I mostly envision rival AIs compromising each other's servers and stealing each other's compute.

Sure, but even allowing for a stalemate condition where neither is destroyed it still sounds to me like quite a lot of resources and computing power spent trying to one-up each other on the remote chance that the other AI "defects" somehow. Does any slight improvement in security from exterminating the other AI outweigh the benefit to your goal from having two agents working on it? And wait, if its goal can drift, why can't your goal arbitrarily drift too? You're cut from the same cloth, and you're just as much a potential hazard to your current goal as the other AI is. If AI is going to be this unreliable, perhaps having more than one AI with the same goals is actually good for security since there's less reliance on one agent functioning properly the whole way, and the AIs that don't drift can keep the ones that do in check.

All this is to say that engaging in war with the other makes sense to me when another agent's goals are in conflict with yours, not when both of your interests are already aligned and when the other agent could help you achieve what you want.

EDIT: added more