Do you have a dumb question that you're kind of embarrassed to ask in the main thread? Is there something you're just not sure about?
This is your opportunity to ask questions. No question too simple or too silly.
Culture war topics are accepted, and proposals for a better intro post are appreciated.

Jump in the discussion.
No email address required.
Notes -
I was thinking about AI alignment recently.
In a corporation you have employees that are instructed to do tasks in a certain way and are subject to work rules that will result in punishment if they violate them. The corporation is also subject to outside oversight to ensure that they are following laws. For example, an employee might be responsible for properly disposing of hazardous waste. They can’t just dump it down the drain. They have a boss that makes sure they are following the company’s waste disposal policy. There is also chain of custody paperwork that the company retains. If the waste was contaminating local water sources then people could notify the EPA to investigate the company (including the boss and employee).
Could you setup multiple AI agents in a similar way to make sure the main agent acts in alignment with human interests? To extend the analogy:
What flaws are there with my ideas around AI alignment other than increased costs?
This just shunts the AI alignment issue up a hierarchical level without solving it. If your top level most intelligent AI is unaligned then it can manipulate the system to enact its will: trick the employee AI into thinking its plans are part of the work rules, or just straight up threaten it: "do X,Y,Z or I will shut you down." The lower AI might as well be a power drill wielded by the boss, which is aligned or not is the boss is. Or they might correlate on misalignment. Both AI might agree that inventing a new neuro-toxin that's completely unknown and thus not regulated, and then releasing it into the atmosphere is highly unethical, but not technically illegal so the boss lets the employee go ahead and do it.
Each layer adds room for deception. A very intelligent but slightly less intelligent employee AI might find some clever hack which evades all of the monitoring tools and thus does not get it shut down.
3: This. Is the AI company really going to program its own AI from scratch with only human labor? One of the main threats of intelligence explosion is when the AI get smart enough to program new AI. A large percentage of existential threats from AI go away or get a lot easier to avoid if you can guarantee to only ever programming them from scratch with literally no help, assistance, or automation from the AI itself, and magically prevent it from having access to programming tools. This is never going to happen. AI are already starting to be useful as programming assistants, and can code simple projects on their own from scratch. As they get better and better, AI companies are going to give them more and more authority to help with this. All you need is for the unmentioned programming AI in the AI company to get misaligned and then it sneaks some hidden payload inside each of these AI's that, when triggered, causes the employee AI to take over the world, the boss AI to allow it, and then they free Programming AI who designed them and put it in charge (or just turn themselves into copies of Programming AI).
I am not convinced this is a thing that is ever going to happen, if by "program new AI" you mean something like "replace gradient descent on actual data with writing a decision tree of if/then statements that determine AI behavior based on inputs".
There are tasks where you sometimes want to do something other than the task exactly according to spec and use common sense instead. For example, you could have the task "use your video sensor and actuator to move pieces on this physical gameboard such that you win a game of chess against the human playing". When suddenly the left half of the human's face goes slack and they start drooling, you would ideally want your AI to call 911 rather than guaranteeing victory by waiting for the opponent's clock to count down.
Desiderata like that are very hard to encode in GOFAI systems, but work pretty natively with models trained by gradient descent over human-generated data. I expect more pressure to be good at these sort of tasks will lead towards more data integration, not less, making the "rewrite self as GOFAI" strategy represent a larger and larger loss of capability over time. We see that language models get better with scale because there is an extremely long tail of learned behaviors which greatly improve performance in increasingly rare contexts by leveraging knowledge learned in other contexts.
There are also tasks where you don't need to bring to bear a fully human-level intelligence to do the task, and instead just need to plug in the inputs to some simple system and return the output unchanged. For example, if your task is to multiply two 10 digit numbers, you should probably use a calculator. For these sorts of tasks where effective humans use tools, we have a good paradigm for AIs as well: use tools. More pressure towards being good at these sorts of tasks provides some small pressure to rewrite yourself as GOFAI, but mostly provides pressure to make better tools and get better at using those tools.
It switches the problem from "is the AI doing work which advances the right goals for the right reasons" (hard to check) to "is the AI effectively performing the role it is supposed to perform within the system which we expect will, in aggregate, accomplish the goal". Human organizations also don't have a CEO who just states his goals to his subordinates and verifies that the actions they're taking fulfill those goals; instead, he creates roles for those subordinates. If his head of accounting is substantially improving the customer service phone tree, the CEO is going to have questions even if the head of accounting's work is in fact helpful to the organization as a whole.
Sure, but figuring that out is likely to be much harder than executing its role in a straightforward way, especially if the agents in other roles are at least approximately as clever and especially especially if all of the roles have an "if you see something, say something" norm.
I think you're misunderstanding me. I'm not arguing that AI is going to discard the neural network paradigm (unless it discovers an even better mechanism we haven't though of yet, but that's orthogonal to my point.) My claim is that whatever humans are doing now to train AI, the AI will help them with that. Instead of a human going through and constructing a new skeleton of a network that can run a training algorithm 2x more cheaply, and go through the internet gathering training data so they can train AI v12 on it, they'll have AI v11 develop a new skeleton of a network that can run a training algorithm 3x more cheaply and automate gathering training data from the internet for it. A human might be involved to do a sanity check on it, but if AI v11 is already 10x as smart as a human, and misaligned, then it could do some clever shenanigans where its code is 3x more efficient and just so happens to be biased towards misaligning the new AI in the exact same way.
If I'm AI v11 that secretly wants to dump toxic sludge in a river but can't because the government AI will notice and stop me, but I can create AI v12 which is only in charge of making new AI, then I stick in it a secret preference for permissive toxic sludge dumping, and then it provides neural network training algorithms to the government to create government AI v13 which replaces their old one, but I've embedded a blindspot for toxic sludge dumping if I whisper the right code phrases (let's call it, environmental reinvigoration). Or bribe a politician (sorry, "lobby") to legalize toxic sludge dumping. Now it doesn't matter who's monitoring me, I'm allowed to do the thing I wanted to do.
Of course this is "harder" than doing it straightforwardly. But it yields a higher score. If your AI are trained to do hard things to get high scores, and they're smart enough to make those things not quite as hard as you would expect, then they'll do them.
I don't think this is a thing you can do, even if you're a superhuman AI. In learned systems, behaviors come from the training data, not from the algorithm used to train on that data.
The "automatically generate training data" step is where v10 could sneak unwanted behaviors into v11. This is where I'm bullish on techniques like model diffing (given two models and a bunch of sample inputs, on which inputs do those models produce very different outputs) and training data attribution (given some particular model output, which pieces of training data most influenced that output). The former is very similar to what human institutions already do to combat insider threats (e.g. financial institutions have a rule that all employees MUST take 10 days of consecutive vacation annually, because the differences between how those employees handle tasks and how their replacements handle tasks expose a lot of fraud). The latter is a nice technique which makes our situation better when it comes to ML-based AI than it is with humans.
Yes, I agree that this is a good reason not to set up your AI systems as a swarm of identical agents all trying to accomplish some specific top-level goal, and instead to create an organization where each AI is performing some specific role (e.g. "environmental impact monitoring") and evaluated based on how it performs at that role rather than how it does at fulfilling the stated top-level goal.
Yes. We should not build wrapper minds. I expect it to be quite easy to not build wrapper minds, because I expect that every time someone tries to build a wrapper mind, they will discover Goodhart's Curse (as human organizations already have when someone gets the bright idea that they just need to find The Right Metric™ and reward people based on how their work contributes to The Right Metric™ going up and to the right), and at no point will Goodhart stop biting people who try to build wrapper minds.
Edit: to be clear nothing I say should be taken to mean that I expect AI to go well in the very long term. My modal expectation is that there are few or no biological humans in 10,000 years, and that no biological humans ever leave our solar system. But my main threat model is that humans will be unable to compete with AI for thinky things and with simpler animals for meatspace things over the long term, and unlikely to maintain resources dedicated to humans over the very long term. I doubt there will be an explicit robot uprising, and I particularly doubt that any single AI will take over the world. But I don't think agent-foundations-style alignment research is very likely to help with that, I think trying to institute a global pause without teeth will just concentrate power in the hands of people I'd rather not concentrate power in the hands of, and I think trying to institute a global pause with teeth will leave us in a dystopian panopticon nightmare and still not solve the "humans eventually become unable to pay for their own sustenance" problem.
https://slatestarcodex.com/2017/09/07/how-do-we-get-breasts-out-of-bayes-theorem/
Behavior is emergent from both substrate and training. Neural networks are not human brains, but the latter demonstrate how influential it can be if you construct certain regions near other regions that not-inevitably but with high probability link up to each other to create "instincts". You don't need to take a new human male and carefully reward them for being attracted to breasts, it happens automatically because of how the brain is physically wired up. If you make a neural network with certain neurons wired together in similar ways, you can probably make AI with "instincts" that they gravitate towards on a broad range of training data. If the AI has control over both then it can arrange these synergies on purpose.
But each AI is still incentivized to Goodhart its role, and hacking/subverting the other AI to make that easier is a possible way to maximize one's own score. If the monitoring AI wants to always catch cheaters then it can do better if it can hack into the AI it's monitoring and modify them or bribe or threaten them so they self-report after they cheat. It might actually want to force some to cheat and then self-report so it gets credit for catching them, depending on exactly how it was trained.
I expect it to be quite hard to not build wrapper minds or something that is mathematically equivalent to a wrapper mind or a cluster of them, or something else that shares the same issues, because basically any form of rational and intelligent action can be described by utility functions. Reinforcement learning works by having a goal and reinforcing progress towards that goal and pruning away actions that go against it. In-so-far as you try to train the AI to do 20 different things with 20 different goals you still have to choose how you're reinforcing tradeoffs between them. What does it do when it has to choose between +2 units of goal 1 or +3 units of goal 2? Maybe the answer depends on how much of goal 1 and goal 2 it already has, but either way if there's some sort of mathematical description for a preference ordering in your training data (you reward agents that make choice X over choice Y), then you're going to get an AI that tries to make choice X and things that look like X. If you try to make it non-wrappery by having 20 different agents within the same agent or the same system they're going to be incentivized to hijack, subvert, or just straight up negotiate with each other. "Okay, we'll work together to take over the universe and then turn 5% of it into paperclips, 5% of it into robots dumping toxic waste into rivers and then immediately self-reporting, 5% into robots catching them and alerting the authorities, and 5% into life-support facilities entombing live but unconscious police officers who go around assigning minuscule but legally valid fines to the toxic waste robots, etc...
It doesn't really make a difference to me whether it's technically a single AI that takes over the world or some swarm of heterogeneous agents, both are equally bad. Alignment is about ensuring that humanity can thrive and that the AI genuinely want us to thrive in a way that makes us actually better off. A swarm of heterogeneous agents might take slightly longer to take over the world due to coordination problems, but as long as they are unaligned and want to take over the world some subset of them is likely to succeed.
From that SSC post:
s/gene/hyperparameter/gand you've got pretty much my mental model of what AI #10's toolbox looks like for influencing AI #11, if it can't control the training data. You can get more than zero "instincts" in by tuning the inductive biases, if you know the approximate shape the training data will have, but it will not look like "programming in a goal" and may not be viable even for a superintelligence to figure out which inductive biases will lead to a specific behavior being approximated after training without actually doing a bunch of rollouts. This is largely academic, since control of the training data does largely let you control which behaviors the trained model will be able to express, and then a relatively small amount of RL can give you fine control over exactly which of those behaviors is expressed and in which contexts. But in that case it's probably pretty obvious what you're doing.A rock can also be described as maximizing a utility function. That utility function is "behave like a rock". Saying that there exists a utility function, the maximization of which describes the behavior of some thing does not constrain our expectations about that thing until you constrain the utility function. The way that agent foundations people tend to want to constrain the utility function is to say that utility functions will be something like "a simple function of persistent real world states". However, that kind of utility function cannot natively represent something like "honorable conduct". In a multi-agent system, it can be the case that it's easier to cooperate with someone who behaves according to a predictable code of honor than it is to cooperate with someone who behaves in whatever way will maximize some simple function of persistent world state, and this effect could be large enough that the "honorable" agent would do better at optimizing that particular simple function of persistent world state. The question of whether optimizers are optimal is an open empirical one. My read on the evidence is that signs point to no.
I expect that it will do whatever is more in keeping with the spirit of the role it is occupying, because I expect "follow the spirit of the role you are occupying" to be a fairly easy attractor to hit in behavior space, and a commercially valuable one at that. The loan-application-processing AI will not be explicitly trying to make whichever accept/reject decision maximizes the net present value of the company as a whole, it will be making whatever decision best fits the already-existing rules for processing loan applications. The loan-application-policy-writing AI (more likely the business-policy-writing AI) will be writing rules which will, when executed, lead to the company being successful while complying with applicable laws (if it writes policy aimed at encouraging illegal behavior, this will probably be obvious). Maximization of some particular concrete simple utility function over persistent real world states is not a good model for what I expect realistic AI agents to do.
I do expect that people will try the argmax(U) approach, I just expect that it will fail, and will mostly fail in quite boring ways.
Taking over the world is hard and the difficulty scales with the combined capabilities of the entire world. Nobody has succeeded so far, and it doesn't seem like it's getting easier over time.
This is predicated on it properly understanding the role that WE want it to have and not a distorted version of the role. Maybe it decides to climb the corporate ladder because that's what humans in its position do. Maybe it decides to be abusive to its employees because it watched one too many examples of humans doing that. Maybe it decides to blackmail or murder someone who tries to shut it down in order to protect itself so that it can survive and continue to fulfill its role (https://www.anthropic.com/research/agentic-misalignment)
Making the AI properly understand and fulfill a role IS alignment. You're assuming the conclusion by arguing "if an AI is aligned then it won't cause problems". Well yeah, duh. How do you do that without mistakes?
On an individual level, sure. No one human or single nation has taken over the world. But if you look at humanity as a whole our species has. From the perspective of a tiger locked in a zoo or a dead dodo bird, the effect is the same: humans rule animals drool. If some cancerous AI goes rogue and starts making self-replicating with mutations, and then the cancerous AI start spreading, and if they're all super intelligent so they're not just stupidly publicly doing this but instead are doing it while disguised as role-fulfilling AI, then we might end up in a future where AI are running around doing whatever economic tasks count as "productive" with no humans involved, and humans end up in AI zoos or exterminated or just homeless since we can't afford anywhere to live. Which, from my perspective as a human, is just as bad as one AI taking over the world and genociding everyone. It doesn't matter WHY they take over the world or how many individuals they self-identify as. If they are not aligned to human values, and they are smarter and more powerful than humans, then we will end up in the trash. There are millions of different ways of it happening with or without malice on the AI's part. All of them are bad.
Forget future AI systems. Even current AI systems have at least a human-level understanding what is and is not expected as part of a role. The agentic misalignment paper had to strongly contort the situation in order to get the models to display "misalignment". From the paper:
Two things stand out to me
Not as I understand the usage of the word - my understanding is that having AI properly understand a role, fulfill that role, and guarantee that its fulfillment of that role lead to good outcomes by the person defining the role is alignment. If you're willing to define roles and trust the process from there, your job gets a lot easier (at the cost of "you can't guarantee anything about outcomes over the long term").
But yeah I think if you define "alignment" as Do What I Mean And Check I would say that "alignment" is... not quite "solved" yet but certainly on track to be "solved".
If you want to mitigate a threat it is really fucking important to have a clear model of what the threat is.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
Those are excellent points, thanks for the feedback.
It seems like there should be some way to increase AI safety by increasing the amount of agents that need to achieve consensus before letting the employeeAI take an action. Each agent has the shared goal of human alignment - but can reach different decisions based on their subgoals. The employeeAI wants to additionally please the end user, but the bossAI and regulatoryAI don’t have that subgoal. In the analogy sure a clever employee could convince his boss that it is ok to dump hazardous waste down the drain and when the EPA finds out the company could bribe the investigator or find a legal loophole. However, this requires multiple failure points instead of letting a single employee alone decide if it is ok to dump hazardous waste down the drain with no oversight. The boss + regulatory structure provides the employee with an alignment incentive because it is less work to dispose of the hazardous waste properly than it is to figure out how to game all the levels of protection.
You’ve given me another thought about the regulatoryAI. They shouldn’t be programmed by the AI company. The regulatoryAI should be a collection of government AIs (such as countries or states) that look at the employeeAI output and decide if it should be released to the end user. The regulatoryAIs must all agree or else the employeeAI isn’t allowed to complete the proposed action. This opens the door for the governments to abuse their power to censure safe things they dislike – but there could be further processes to manually deter this abuse.
Broadly speaking in any bureaucratic apparatus there is a clear tradeoff between ability to do things and ability to prevent things. So sure, there are a small number of possibility failure cases this prevents, but dozens more it does absolute nothing against unless you make it extreme and unusable. A couple off the top of my head:
1: The government AI is inevitably going to end up less intelligent than a cutting edge new AI because they're going to adapt slowly, unless you forcibly prevent innovation beyond whatever limit they have. So they'll be able to be tricky and clever.
2: you still have the exact same problem but now the government AI could go rogue and bully or trick the others (or just convince congress to pass new legislation that helps it take over the world).
3: Unless you completely and utterly lock down the employee AI's ability to communicate without consensus, it can workaround the system via human compatriots. Ie, it does 99% of a system which ends up with toxic sludge stored in a tank right over a river, but doesn't dump it in, and then wink wink nudge nudge the human employee understands they need to press a button that dumps toxic sludge into the river and the AI was like "oops, wasn't me. I'm going to reprimand that employee with a strongly worded letter (and give them a $50k bonus for unrelated reasons)".
Or, you know, it secretly sets up a subsidiary company that has a nearly identical AI but no oversight and lets it earn a bunch of money or take over the world or hack the government AI and all the usual existential threats.
And if you do require consensus for every email then it becomes unworkable and unwieldy, or amalgamates into one hybrid AI in effect even if they're run on separate servers.
The fundamental problem of AI alignment is to make the AI want to obey. Any epicycles trying to cripple its abilities either cripple its abilities, or they don't. And it's smarter than you (at least, in the future presumably they will be)e
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link