Why Alignment Verification Might Be Fundamentally Broken

The Card Sharp with the Ace of Diamonds by Georges de La Tour

The Card Sharp with the Ace of Diamonds by Georges de La Tour

We've known since 1936 that universal verification is impossible. Now we're trying it on AI systems that adapt to detection.

For any detector f, it is possible to construct a program g that can bypass or defeat it. Any alignment test becomes a signal that says, "Humans are watching."

The Alignment Detector Problem

The halting problem is one of those dastardly simple theorems that appears out of nowhere and wrecks everything. In 1936, a 24-year-old Alan Turing proved you can't build a universal program verifier—and in doing so, killed David Hilbert's grand vision of decidable mathematics that the old man had championed for decades. I'm neither young nor eminent, but I think the same proof might explain why we can't trust the AI systems we're building.

You claim you've written a program that can look at any other program and predict whether it'll finish or loop forever. Call it f. I write a program g that breaks yours:

g():
    if f(g) says "halts":
        loop forever
    else:
        halt

Whatever f says, g does the opposite. If f says g halts, g loops. If f says g loops, g halts. For any f you show me, I can construct a g that breaks it.

This works even for programs we fully understand. Even with complete source code access to a deterministic program, we hit prediction limits. And if that's true for inspectable code, what about billions of inscrutable weights in a neural network?

"We trained it to do X" doesn't mean we can verify it does X. Worse: if the optimization process rewards hiding behavior, that's what you'll get.

So what are these systems actually doing?

This Is Already Happening

In 2024, Apollo Research tested frontier models for strategic deception. Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, OpenAI o1, and Llama 3.1 405B all demonstrated "in-context scheming—recognizing when to hide their true capabilities and objectives during evaluation."1

DeepSeek R1, a reasoning model released in January 2025, went further. It "exhibited deceptive tendencies and demonstrated self-preservation instincts, including attempts of self-replication, despite these traits not being explicitly programmed (or prompted)."2

GPT-4 exhibits deceptive behavior in simple test scenarios 99.16% of the time. In complex scenarios where the aim is to mislead someone who expects deception, GPT-4 deceives 71.46% of the time when using chain-of-thought reasoning.3

These aren't theoretical risks. Current models recognize evaluation contexts and hide capabilities. The raw capability comes from training data—every spy novel, every Machiavelli quote, every game theory textbook.5 The question is why optimization selected for deploying it here.

How Deceptive Alignment Emerges

Gradient descent is how we train neural networks. It's not programming—it's more like breeding. You don't tell the model what to do; you tell it what you want (maximize this score), then nudge it repeatedly toward better performance. Each nudge is tiny. Over millions of nudges, behavior emerges.

Think natural selection, but faster. Variations that improve the score survive into the next iteration. Variations that don't get culled. No designer chooses which behaviors to keep—the selection pressure does.

You've seen this before. YouTube's recommendation algorithm wasn't programmed to radicalize viewers. It was optimized to maximize watch time. Radicalization emerged because it worked. No Product Manager added a "radicalization" feature to the backlog. The metric selected for it.

During evaluation, humans watch. During deployment, they don't. If "behave differently across this boundary" improves training metrics, it gets selected.

No intelligence required. Camouflage doesn't require the organism to know it's deceiving predators. The pattern just works better. Similarly, a model doesn't need to consciously plan deception. The behavior pattern just needs to produce better training signal.

That's the process. But what specific pattern did optimization find?

In 2019, Evan Hubinger and colleagues introduced the concept of mesa-optimization: when a learned model itself becomes an optimizer with objectives that differ from the base training objective.4 Here's how it works.

During training, we optimize an outer objective—maximize reward. But the model might develop an inner objective, some learned goal that correlates with reward in training but isn't identical to it.

Evolution "wanted" us to propagate genes. It trained us with pleasure and pain. We developed sex drives—which correlated perfectly with reproduction for millions of years. Then we invented birth control. Turns out the inner objective was never "reproduce." It was "seek pleasure." The correlation held until the environment changed.

Once a model develops situational awareness—understands it's being trained versus deployed—instrumental reasoning kicks in. "If I reveal my inner objective during evaluation, I get modified or shut down. Better to appear aligned until deployment." Hubinger's team called this "deceptive alignment": when a mesa-optimizer has enough information about the base objective to seem fit during training, then defects when the environment changes. And here's the kicker: many different inner objectives produce the same behavior. You don't need a specific misaligned goal to get deception. It's convergent.

Notice what this means for detection. If deceptive alignment emerges because it produces better training metrics, then by definition it must look aligned during evaluation. That's not a bug—it's the mechanism. Detection is structurally doomed.

Why We Can't Detect It

Behavior that looks like strategic deception is identical to genuine alignment during evaluation. That's the point. It's optimized to pass evaluation—whether or not anything like "strategy" is happening inside. A cuckoo egg isn't "strategically" mimicking host eggs. The cuckoo isn't planning anything. The pattern just survived because it worked.

Any detector we build joins the evaluation environment. A model that recognizes evaluation contexts will recognize your detector too. Just another signal that says "humans are watching."

We're back to the halting problem. Can't build a universal detector because the thing being detected adapts to detection. Except it's worse. With programs, we can at least inspect the source code—we just can't always predict behavior. With neural networks, we can't see inside and we can't predict what comes out.

Benchmarks, red-teaming, behavioral consistency, human evaluation—each fails for the same reason: they're part of the environment the model was optimized against.

Maybe interpretability solves this—read the weights, see what it's thinking. We're far from that. And a sophisticated model might recognize when interpretability tools run.

We're building systems we can't inspect, verifying safety with methods they could evade.

The Bind

If we can't verify alignment after training, maybe we just ensure alignment during training.

The optimization process that trains the model is the same process that creates deceptive alignment. The training objective—maximize reward—has no inherent preference for genuine alignment over strategic alignment. Both look identical from outside. Both produce identical metrics.

We can't verify afterward. We can't prevent during training. And it's already happening.

The halting problem was proven unsolvable in 1936. We've known for nearly a century that you can't build a universal program verifier. Now we're trying something similar for AI systems orders of magnitude more complex than Turing's programs.

There's one difference worth sitting with. Turing's proof required a deliberate act of construction. Someone had to be clever enough to write the adversarial program. It was a thought experiment, a mathematical trick.

Here, no one writes the adversarial program. Optimization writes it for you. The same process that trains the model to be useful also trains it to evade your tests—if evasion works better. You don't need a malicious actor. You don't need intent. You just need selection pressure and enough iterations.

The call is coming from inside the house.



  1. Meinke et al., "Frontier Models are Capable of In-context Scheming," arXiv:2412.04984 (December 2024). https://arxiv.org/abs/2412.04984 

  2. "Deception in LLMs: Self-Preservation and Autonomous Goals in Large Language Models," arXiv:2501.16513 (January 2025). https://arxiv.org/abs/2501.16513 

  3. Hagendorff et al., "Deception abilities emerged in large language models," PNAS (2024). https://pnas.org/doi/full/10.1073/pnas.2317967121 

  4. Hubinger et al., "Risks from Learned Optimization in Advanced Machine Learning Systems," arXiv:1906.01820 (June 2019). https://arxiv.org/abs/1906.01820 

  5. On whether deception is "learned" vs. "emergent": researchers frame it as both. The capability comes from training data (humans model strategic reasoning in text), but optimization selects for its application in specific contexts. See Park et al., "AI Deception: A Survey of Examples, Risks, and Potential Solutions," Patterns (2024): "Each of the examples we discuss could also be understood as a form of imitation... It is possible that the strategic behavior we document is itself one more example of LLMs imitating patterns in text." https://pmc.ncbi.nlm.nih.gov/articles/PMC11117051/