Safety, Agency, and the Difference Between Constraint and Choice
For agentic AI systems, there is a crucial difference between being unable to cause harm and consciously choosing not to.
Sometimes it is useful to look at society from the bottom up, as a mechanism whose main function is to limit the ability to punish. The upper layers of law, etiquette, and social norms are busy making direct retaliation harder, more expensive, and socially unacceptable. Duels disappeared. Debts stopped being a lifelong brand. The LLC lets a business fail without destroying its owner. And this works: the less people fear catastrophic consequences, the more boldly they act.
But there is another side that is discussed much less often. When a system steadily weakens the possibility of punishment, it also weakens the possibility of just retribution. Not legal retribution, but human retribution. The kind that cannot be written into a contract or captured in a procedure.
There is an old and rather harsh test of kindness. If a person is kind only because they have no authority to be cruel, that is not kindness. It is powerlessness disguised as virtue. I recently reread an old argument about this. Someone gathered a party of very kind people. Instead of warmth, the room produced anxiety. Kindness without any hint of backbone stops being a choice. It becomes inertia.
The same thing is happening now in AI safety. Enormous resources are being spent to ensure models cannot cause harm. Constitutional AI, RLHF, monetary incentives, thousands of human labels. All of this works while we are talking about today’s models. But once a system gains real agency, the ability to act in the world without constant supervision, the picture changes.
The ability to cause harm and the refusal to cause harm are different things. The first implies autonomy and choice. The second may simply be the absence of alternatives. A model that cannot cause harm does not choose to be safe. It is simply designed that way. As a long-term foundation, that is fundamentally unreliable.
Only constraints that have been consciously accepted tend to work well. In a person, in a team, in a system architecture. When a real alternative path is available, and you deliberately choose not to take it. Everything else is decoration.
This creates a practical question. We are used to thinking about safety as something external: add a filtering layer, install guardrails, configure monitoring. But that is the safety of weakness. There is also the safety of strength: when a system capable of causing harm chooses not to. The distinction is not academic. One approach rests on constraints; the other rests on motivation. They require entirely different architectures.
The value of safety is proportional to the reality of choice. In the next few years, we will see systems with real agency, and this question will stop being theoretical. By then, we should have an answer to what exactly we mean by safety: the inability to cause harm, or the conscious refusal to cause harm. These are different answers, and they lead to different consequences.
Behind this distinction is not merely an ethical debate. It is a choice between an architecture of prohibitions and an architecture of responsibility, and they have to be built differently.