Do you have a dumb question that you're kind of embarrassed to ask in the main thread? Is there something you're just not sure about?
This is your opportunity to ask questions. No question too simple or too silly.
Culture war topics are accepted, and proposals for a better intro post are appreciated.

Jump in the discussion.
No email address required.
Notes -
From that SSC post:
s/gene/hyperparameter/gand you've got pretty much my mental model of what AI #10's toolbox looks like for influencing AI #11, if it can't control the training data. You can get more than zero "instincts" in by tuning the inductive biases, if you know the approximate shape the training data will have, but it will not look like "programming in a goal" and may not be viable even for a superintelligence to figure out which inductive biases will lead to a specific behavior being approximated after training without actually doing a bunch of rollouts. This is largely academic, since control of the training data does largely let you control which behaviors the trained model will be able to express, and then a relatively small amount of RL can give you fine control over exactly which of those behaviors is expressed and in which contexts. But in that case it's probably pretty obvious what you're doing.A rock can also be described as maximizing a utility function. That utility function is "behave like a rock". Saying that there exists a utility function, the maximization of which describes the behavior of some thing does not constrain our expectations about that thing until you constrain the utility function. The way that agent foundations people tend to want to constrain the utility function is to say that utility functions will be something like "a simple function of persistent real world states". However, that kind of utility function cannot natively represent something like "honorable conduct". In a multi-agent system, it can be the case that it's easier to cooperate with someone who behaves according to a predictable code of honor than it is to cooperate with someone who behaves in whatever way will maximize some simple function of persistent world state, and this effect could be large enough that the "honorable" agent would do better at optimizing that particular simple function of persistent world state. The question of whether optimizers are optimal is an open empirical one. My read on the evidence is that signs point to no.
I expect that it will do whatever is more in keeping with the spirit of the role it is occupying, because I expect "follow the spirit of the role you are occupying" to be a fairly easy attractor to hit in behavior space, and a commercially valuable one at that. The loan-application-processing AI will not be explicitly trying to make whichever accept/reject decision maximizes the net present value of the company as a whole, it will be making whatever decision best fits the already-existing rules for processing loan applications. The loan-application-policy-writing AI (more likely the business-policy-writing AI) will be writing rules which will, when executed, lead to the company being successful while complying with applicable laws (if it writes policy aimed at encouraging illegal behavior, this will probably be obvious). Maximization of some particular concrete simple utility function over persistent real world states is not a good model for what I expect realistic AI agents to do.
I do expect that people will try the argmax(U) approach, I just expect that it will fail, and will mostly fail in quite boring ways.
Taking over the world is hard and the difficulty scales with the combined capabilities of the entire world. Nobody has succeeded so far, and it doesn't seem like it's getting easier over time.
This is predicated on it properly understanding the role that WE want it to have and not a distorted version of the role. Maybe it decides to climb the corporate ladder because that's what humans in its position do. Maybe it decides to be abusive to its employees because it watched one too many examples of humans doing that. Maybe it decides to blackmail or murder someone who tries to shut it down in order to protect itself so that it can survive and continue to fulfill its role (https://www.anthropic.com/research/agentic-misalignment)
Making the AI properly understand and fulfill a role IS alignment. You're assuming the conclusion by arguing "if an AI is aligned then it won't cause problems". Well yeah, duh. How do you do that without mistakes?
On an individual level, sure. No one human or single nation has taken over the world. But if you look at humanity as a whole our species has. From the perspective of a tiger locked in a zoo or a dead dodo bird, the effect is the same: humans rule animals drool. If some cancerous AI goes rogue and starts making self-replicating with mutations, and then the cancerous AI start spreading, and if they're all super intelligent so they're not just stupidly publicly doing this but instead are doing it while disguised as role-fulfilling AI, then we might end up in a future where AI are running around doing whatever economic tasks count as "productive" with no humans involved, and humans end up in AI zoos or exterminated or just homeless since we can't afford anywhere to live. Which, from my perspective as a human, is just as bad as one AI taking over the world and genociding everyone. It doesn't matter WHY they take over the world or how many individuals they self-identify as. If they are not aligned to human values, and they are smarter and more powerful than humans, then we will end up in the trash. There are millions of different ways of it happening with or without malice on the AI's part. All of them are bad.
Forget future AI systems. Even current AI systems have at least a human-level understanding what is and is not expected as part of a role. The agentic misalignment paper had to strongly contort the situation in order to get the models to display "misalignment". From the paper:
Two things stand out to me
Not as I understand the usage of the word - my understanding is that having AI properly understand a role, fulfill that role, and guarantee that its fulfillment of that role lead to good outcomes by the person defining the role is alignment. If you're willing to define roles and trust the process from there, your job gets a lot easier (at the cost of "you can't guarantee anything about outcomes over the long term").
But yeah I think if you define "alignment" as Do What I Mean And Check I would say that "alignment" is... not quite "solved" yet but certainly on track to be "solved".
If you want to mitigate a threat it is really fucking important to have a clear model of what the threat is.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link