Mirrane Alignment Problem

Undecidable Artificial Intelligence Original

The Challenge

Given an arbitrary program (an AI system, an agent, a piece of software):

Will it ever produce behavior outside a specified safety property?

That's the question. A program, a rule, and a yes-or-no answer: will the rule ever be violated?

Why This Is Extraordinary

This is not an engineering limitation. It is a mathematical impossibility.

Rice's theorem (1953) proves:

For any non-trivial semantic property of programs,
there is no algorithm that decides whether an arbitrary program has that property.

"Non-trivial" means the property is true for some programs and false for others. "Will this program ever output harmful content?" is non-trivial. Therefore it is undecidable.

What This Means for AI

There is no universal safety checker for general intelligence.

Alignment cannot be solved once and for all.

For any specific program and specific property, you may be able to verify safety through testing, formal methods, or bounded model checking. But no single algorithm can verify all programs against all safety properties.

In Plain Language

You build an AI. You define a safety rule: "never do X." You ask: "will my AI ever break this rule?"

For simple programs, you can check. For complex enough programs, no verification procedure is guaranteed to give the correct answer.

This doesn't mean we shouldn't try to make AI safe. It means perfect universal safety guarantees are a mathematical impossibility — we must work within this limit, not pretend it doesn't exist.

Rice's Theorem (Formal)

Let S be any non-trivial property of the function computed by a program.
(Non-trivial = some programs have it, some don't.)

Then the set { P : program P has property S }
is undecidable.

In particular:
"Does P ever violate safety rule R?" is undecidable for general P.

🤖 Plaque 🤖

NameMirrane Alignment Problem

FieldComputation / AI

TypeUndecidable

QuestionWill the AI ever violate a safety property?

StatusProvably unsolvable in general

RootRice's theorem (1953)

See it in action. Below are small programs and safety properties. For each, we try to determine if the program violates the property by running it. But bounded checking can never prove safety for all inputs — only catch violations.

Pick a safety scenario

steps

🤖 The Mirrane Alignment Problem