Will this AI ever misbehave? For general programs β provably impossible to decide.
Given an arbitrary program (an AI system, an agent, a piece of software):
Will it ever produce behavior outside a specified safety property?
That's the question. A program, a rule, and a yes-or-no answer: will the rule ever be violated?
This is not an engineering limitation. It is a mathematical impossibility.
Rice's theorem (1953) proves:
For any non-trivial semantic property of programs,"Non-trivial" means the property is true for some programs and false for others. "Will this program ever output harmful content?" is non-trivial. Therefore it is undecidable.
There is no universal safety checker for general intelligence.
Alignment cannot be solved once and for all.
For any specific program and specific property, you may be able to verify safety through testing, formal methods, or bounded model checking. But no single algorithm can verify all programs against all safety properties.
You build an AI. You define a safety rule: "never do X." You ask: "will my AI ever break this rule?"
For simple programs, you can check. For complex enough programs, no verification procedure is guaranteed to give the correct answer.
This doesn't mean we shouldn't try to make AI safe. It means perfect universal safety guarantees are a mathematical impossibility β we must work within this limit, not pretend it doesn't exist.