Do you have a dumb question that you're kind of embarrassed to ask in the main thread? Is there something you're just not sure about?
This is your opportunity to ask questions. No question too simple or too silly.
Culture war topics are accepted, and proposals for a better intro post are appreciated.
Jump in the discussion.
No email address required.
Notes -
I don't know a lot about this topic, so I want to see if it makes sense: instrumental convergence is often posed in AI alignment as an existential risk, but could it not simply lead to a hedonistic machine? There is already precedent in the form of humans. As I understand it, many machine learning techniques operate on the idea of fitness, with a part that does something, and another part that rate its fitness. Already, it's common for AI to find loopholes in given tasks and designed aims. Is it a possibility that it would be much easier for the AI to, rather than destroying the world and such, simply find a loophole that gives it an "infinite" fitness/reward score? It seems logical to me that any sufficiently intelligent entity, with such simple coded motivations, would have almost a divergence, precisely because of self-modification. I suppose that the same logic applies to a system that is not originally like this, but turns into an agent.
Essentially: given the possibility of reward hacking, why would an advanced AI blow up the Earth?
Trainspotting would have been a much happier movie if Renton and friends were able to do their reward hacking without fucking over everyone around them.
I do admit that I'm assuming that computers will not be similarly stupid lol but yes, I definitely thought a little about a comparison with humans.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link