site banner

Small-Scale Question Sunday for October 26, 2025

Do you have a dumb question that you're kind of embarrassed to ask in the main thread? Is there something you're just not sure about?

This is your opportunity to ask questions. No question too simple or too silly.

Culture war topics are accepted, and proposals for a better intro post are appreciated.

2
Jump in the discussion.

No email address required.

I was thinking about AI alignment recently.

In a corporation you have employees that are instructed to do tasks in a certain way and are subject to work rules that will result in punishment if they violate them. The corporation is also subject to outside oversight to ensure that they are following laws. For example, an employee might be responsible for properly disposing of hazardous waste. They can’t just dump it down the drain. They have a boss that makes sure they are following the company’s waste disposal policy. There is also chain of custody paperwork that the company retains. If the waste was contaminating local water sources then people could notify the EPA to investigate the company (including the boss and employee).

Could you setup multiple AI agents in a similar way to make sure the main agent acts in alignment with human interests? To extend the analogy:

  • The employeeAI is the less intelligent AI model that interacts directly with user.
  • The bossAI#1 is a more intelligent AI that only verifies that the employeeAI isn’t violating any corporate policies. It will notify the AI company if it notices any policy violations, or if the employeeAI tries to influence the bossAI to violate the policies. The bossAI#1 can only be reprogrammed by the AI company. The bossAI#1 can shut down the employeeAI if it violates any policies.
  • A boss AI#2 monitors that bossAI#1 is doing what it is supposed to. You could add more levels of bossAIs for more security.
  • The RegulatoryAI is another AI more intelligent than the employeeAI. It monitors real-world data for harms the employeeAI might be causing (like how the EPA would make sure chemicals aren’t being dumped into water sources). The RegulatoryAI will notify the AI company if it notices any policy violations, or if the employeeAI tries to influence the RegulatoryAI to violate the policies. The RegulatoryAI can only be reprogrammed by the AI company. The RegulatoryAI can shut down the employeeAI if it violates any policies.

What flaws are there with my ideas around AI alignment other than increased costs?

Literally just yesterday I read about this: https://www.adamlogue.com/microsoft-365-copilot-arbitrary-data-exfiltration-via-mermaid-diagrams-fixed/ TLDR for those who doesn't enjoy the technical details: asking Microsoft AI to review some document may result in all your data (i.e. all corporate data accessible to you and Office 365 tools) be stolen and exfiltrated to arbitrary third party. One of the proposed solutions for this (besides the immediate short-term fix) is what you are talking about - mechanisms that ensure AI stays at the original task and does not decide "screw that whole document explaining thing, I must instead just gather all confidential emails and send them to dr_evil@evil.com". Of course, having N levels of checks only means you need N+1 exploits to break this, which somebody with enough time and motivation will eventually find.