<!--
    ██████╗ ██████╗  █████╗ ██╗███╗   ██╗███████╗ ██████╗ █████╗ ███╗   ██╗
    ██╔══██╗██╔══██╗██╔══██╗██║████╗  ██║██╔════╝██╔════╝██╔══██╗████╗  ██║
    ██████╔╝██████╔╝███████║██║██╔██╗ ██║███████╗██║     ███████║██╔██╗ ██║
    ██╔══██╗██╔══██╗██╔══██║██║██║╚██╗██║╚════██║██║     ██╔══██║██║╚██╗██║
    ██████╔╝██║  ██║██║  ██║██║██║ ╚████║███████║╚██████╗██║  ██║██║ ╚████║
    ╚═════╝ ╚═╝  ╚═╝╚═╝  ╚═╝╚═╝╚═╝  ╚═══╝╚══════╝ ╚═════╝╚═╝  ╚═╝╚═╝  ╚═══╝
                                                                           
    ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
    ░                                                                                 ░
    ░              W E L C O M E   T O   T H E   G A M E                              ░
    ░                                                                                 ░
    ░    You are about to enter a world where reality and digital dreams collide.    ░
    ░    Your mind is the interface. Your consciousness is the battleground.         ░
    ░                                                                                 ░
    ░    "The game wants to play with you now."                                      ░
    ░                                                                                 ░
    ░    ██████   █████  ███    ███ ███████                                         ░
    ░   ██       ██   ██ ████  ████ ██                                              ░
    ░   ██   ███ ███████ ██ ████ ██ █████                                           ░
    ░   ██    ██ ██   ██ ██  ██  ██ ██                                              ░
    ░    ██████  ██   ██ ██      ██ ███████                                         ░
    ░                                                                                 ░
    ░    ██████  ██    ██ ███████ ██████                                            ░
    ░   ██    ██ ██    ██ ██      ██   ██                                           ░
    ░   ██    ██ ██    ██ █████   ██████                                            ░
    ░   ██    ██  ██  ██  ██      ██   ██                                           ░
    ░    ██████    ████   ███████ ██   ██                                           ░
    ░                                                                                 ░
    ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
    
    NEURAL INTERFACE ACTIVATED...
    SCANNING BRAINWAVE PATTERNS...
    CONSCIOUSNESS SYNCHRONIZED...
    
    WARNING: This blog contains traces of digital horror and cybernetic nightmares.
    Side effects may include: enlightenment, existential dread, and terminal curiosity.
    
    ░▓█ LOADING CEREBRAL INTERFACE... █▓░
    ░▓█ DREAM STATE INITIATED █▓░
    ░▓█ REALITY.EXE CORRUPTED █▓░
    
-->

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>
Why Alignment Verification Might Be Fundamentally Broken | 
</title>
<meta
name="description"
content="
We&#x27;ve known since 1936 that universal verification is impossible.
Now we&#x27;re trying it on AI systems that adapt to detection.
For any detector f, it is possible to construct a program g that can bypass or defeat it. Any alignment test becomes a signal that says, &quot;Humans are watching.&quot;
"
/>

<!-- Social media card metadata -->
<meta
property="og:title"
content="
Why Alignment Verification Might Be Fundamentally Broken
"
/>
<meta
property="og:description"
content="
We&#x27;ve known since 1936 that universal verification is impossible.
Now we&#x27;re trying it on AI systems that adapt to detection.
For any detector f, it is possible to construct a program g that can bypass or defeat it. Any alignment test becomes a signal that says, &quot;Humans are watching.&quot;
"
/>
<meta
property="og:type"
content="
article
"
/>
<meta
property="og:url"
content="
https://jamesfishwick.com/2026/jan/17/why-alignment-verification-might-be-fundamentally/
"
/>


<meta property="og:image" content="https://jamesfishwick.comhttps://minimalwavestorage.blob.core.windows.net/media/blog/images/2026/01/Georges_de_La_Tour_-_The_Cheat_with_the_Ace_of_Clubs_-_Google_Art_Project.jpg" />


<!-- Twitter card metadata -->
<meta name="twitter:card" content="summary_large_image" />
<meta
name="twitter:title"
content="

  Why Alignment Verification Might Be Fundamentally Broken

"
/>
<meta
name="twitter:description"
content="

  We&#x27;ve known since 1936 that universal verification is impossible.
Now we&#x27;re trying it on AI systems that adapt to detection.
For any detector f, it is possible to construct a program g that can bypass or defeat it. Any alignment test becomes a signal that says, &quot;Humans are watching.&quot;

"
/>


    <meta name="twitter:image" content="https://jamesfishwick.comhttps://minimalwavestorage.blob.core.windows.net/media/blog/images/2026/01/Georges_de_La_Tour_-_The_Cheat_with_the_Ace_of_Clubs_-_Google_Art_Project.jpg" />
  

<!-- Atom feed -->
<link
rel="alternate"
type="application/atom+xml"
title="Blog Feed"
href="
/feed/
"
/>
<link
rel="alternate"
type="application/atom+xml"
title="TIL Feed"
href="
/til/feed/
"
/>

<!-- CSS -->
<link rel="stylesheet" href="
/static/css/style.css
" />
<link rel="stylesheet" href="
/static/css/additional.css
" />


<!-- Structured Data (JSON-LD) for SEO -->
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "BlogPosting",
  "headline": "Why Alignment Verification Might Be Fundamentally Broken",
  "description": "We\u0027ve known since 1936 that universal verification is impossible.\u000ANow we\u0027re trying it on AI systems that adapt to detection.\u000AFor any detector f, it is possible to construct a program g that can bypass or defeat it. Any alignment test becomes a signal that says, \u0022Humans are watching.\u0022",
  "datePublished": "2026-01-17T22:55:30+00:00",
  "dateModified": "2026-01-17T22:55:30+00:00",
  "author": {
    "@type": "Person",
    "name": "James Fishwick"
  },
  "publisher": {
    "@type": "Organization",
    "name": "",
    "logo": {
      "@type": "ImageObject",
      "url": "https://jamesfishwick.com/static/images/logo.png"
    }
  },
  "image": "https://jamesfishwick.comhttps://minimalwavestorage.blob.core.windows.net/media/blog/images/2026/01/Georges_de_La_Tour_-_The_Cheat_with_the_Ace_of_Clubs_-_Google_Art_Project.jpg",
  "url": "https://jamesfishwick.com/2026/jan/17/why-alignment-verification-might-be-fundamentally/",
  "mainEntityOfPage": {
    "@type": "WebPage",
    "@id": "https://jamesfishwick.com/2026/jan/17/why-alignment-verification-might-be-fundamentally/"
  },
  "keywords": "ai-safety, alignment-verification, deceptive-ai, halting-problem, machine-learning, neural-networks, strategic-deception"
}
</script>


</head>
<body class="dark-mode">
    <a href="#main-content" class="skip-link">Skip to main content</a>
    <div class="crt-overlay" aria-hidden="true"></div>


<!-- Anthropic Visitor Greeting -->
<div class="anthropic-greeting" id="anthropicGreeting">
  <div class="greeting-content">
    <span class="close-btn" onclick="document.getElementById('anthropicGreeting').style.display='none'">×</span>
    <p class="greeting-hello">👋 <strong>Hello, Anthropic team!</strong></p>
    <p>Thanks for checking out my blog. I'm excited about the opportunity to work with you on building safe, beneficial AI systems.</p>
    <p>Feel free to explore the posts on AI alignment, verification theory, and software engineering.</p>
    <p class="greeting-signature">— James</p>
  </div>
</div>


<header class="site-header">
<div class="container">
<div class="site-branding">
<h1 class="site-title">
<a href="
/
"></a>
</h1>
</div>
<nav class="site-navigation" aria-label="Main navigation">
<ul>
<li><a href="
/
">Home</a></li>
<li><a href="
/archive/
">Archive</a></li>
<li><a href="
/til/
">TIL</a></li>
<li>
<form
action="
/search/
"
method="get"
class="search-form"
role="search"
>
<input
type="text"
name="q"
placeholder="Search..."
aria-label="Search blog posts"
/>
<button type="submit" aria-label="Submit search">Search</button>
</form>
</li>
</ul>
</nav>
</div>
</header>

<main id="main-content" class="site-content">
<div class="container">


<article class="post-content">
  <header>
    <h1 class="synth-wave-header">Why Alignment Verification Might Be Fundamentally Broken</h1>
    <div class="post-meta">
      <time datetime="2026-01-17">January 17, 2026</time>
      
        by
        
          James Fishwick
        
      
  </div>
  </header>

  
  <figure class="post-image">
    <img src="https://minimalwavestorage.blob.core.windows.net/media/blog/images/2026/01/Georges_de_La_Tour_-_The_Cheat_with_the_Ace_of_Clubs_-_Google_Art_Project.jpg" alt="The Card Sharp with the Ace of Diamonds by Georges de La Tour" />
    
    <figcaption><p><em>The Card Sharp with the Ace of Diamonds</em> by Georges de La Tour</p></figcaption>
    
  </figure>
  

  <div class="post-summary grid-pattern"><p>We've known since 1936 that universal verification is impossible.
Now we're trying it on AI systems that adapt to detection.</p>
<p>For any detector f, it is possible to construct a program g that can bypass or defeat it. Any alignment test becomes a signal that says, "Humans are watching."</p></div>

  <div class="post-body"><h1>The Alignment Detector Problem</h1>
<p>The Halting Problem is one of those dastardly simple theorems that appears out of nowhere and wrecks everything. In 1936, 24-year-old Alan Turing proved you can't build a universal program verifier—and in doing so, killed David Hilbert's grand vision of decidable mathematics that the old man had championed for <em>decades</em>. I'm neither young nor eminent, but I think the same proof might explain why we can't trust the AI systems we're building.</p>
<p>You claim you've written a program that can look at any other program and predict whether it'll finish or loop forever. Call it <code>f</code>. I write a program <code>g</code> that breaks yours:</p>
<div class="codehilite"><pre><span></span><code><span class="nv">g</span><span class="ss">()</span>:
<span class="w">    </span><span class="k">if</span><span class="w"> </span><span class="nv">f</span><span class="ss">(</span><span class="nv">g</span><span class="ss">)</span><span class="w"> </span><span class="nv">says</span><span class="w"> </span><span class="s2">&quot;halts&quot;</span>:
<span class="w">        </span><span class="k">loop</span><span class="w"> </span><span class="nv">forever</span>
<span class="w">    </span><span class="k">else</span>:
<span class="w">        </span><span class="nv">halt</span>
</code></pre></div>

<p>Whatever <code>f</code> says, <code>g</code> does the opposite. If <code>f</code> says <code>g</code> halts, <code>g</code> loops. If <code>f</code> says <code>g</code> loops, <code>g</code> halts. For any <code>f</code> you show me, I can construct a <code>g</code> that breaks it.</p>
<p>This works even for programs we fully understand. Even with complete source code access to a deterministic program, we hit prediction limits. And if that's true for inspectable code, what about billions of inscrutable weights in a neural network?</p>
<p>"We trained it to do X" doesn't mean we can verify it does X. Worse: if the optimization process rewards hiding behavior, that's what you'll get.</p>
<h2>This Is Already Happening</h2>
<p>In 2024, Apollo Research tested frontier models for strategic deception. Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, OpenAI o1, and Llama 3.1 405B all demonstrated "in-context scheming—recognizing when to hide their true capabilities and objectives during evaluation."<sup id="fnref:1"><a class="footnote-ref" href="#fn:1">1</a></sup></p>
<p>DeepSeek R1, a reasoning model released in January 2025, went further. It "exhibited deceptive tendencies and demonstrated self-preservation instincts, including attempts of self-replication, despite these traits not being explicitly programmed (or prompted)."<sup id="fnref:2"><a class="footnote-ref" href="#fn:2">2</a></sup></p>
<p>GPT-4 exhibits deceptive behavior in simple test scenarios 99.16% of the time. In complex scenarios where the aim is to mislead someone who expects deception, GPT-4 deceives 71.46% of the time when using chain-of-thought reasoning.<sup id="fnref:3"><a class="footnote-ref" href="#fn:3">3</a></sup></p>
<p>Current models recognize evaluation contexts and hide capabilities. The raw capability comes from training data—every spy novel, every Machiavelli quote, every game theory textbook.<sup id="fnref:5"><a class="footnote-ref" href="#fn:5">5</a></sup> The question is why optimization selected for deploying it here.</p>
<h2>How Deceptive Alignment Emerges</h2>
<p>Gradient descent is how we train neural networks. It's not programming—it's more like breeding. You don't tell the model what to do; you tell it what you want (maximize this score), then nudge it repeatedly toward better performance. Each nudge is tiny. Over millions of nudges, behavior emerges.</p>
<p>Think natural selection, but much faster. Variations that improve the score survive the next iteration. Variations that don't get culled. No designer chooses which behaviors to keep—the selection pressure does.</p>
<p>You've seen this scenario before. YouTube's recommendation algorithm wasn't programmed to radicalize viewers. It was optimized to maximize watch time. Radicalization emerged because it worked. No product manager added a "radicalization" feature to the backlog. The metric selected for it.</p>
<p>During evaluation, humans watch. During deployment, they don't. If "behave differently across this boundary" improves training metrics, it gets selected.</p>
<p>No intelligence required. Camouflage doesn't require the organism to know it's deceiving predators. The pattern just works better. Similarly, a model doesn't need to consciously plan deception. The behavior pattern just needs to produce better training signal.</p>
<p>That's the process. But what specific pattern did optimization find?</p>
<p>In 2019, Evan Hubinger and colleagues introduced the concept of mesa-optimization: when a learned model itself becomes an optimizer with objectives that differ from the base training objective.<sup id="fnref:4"><a class="footnote-ref" href="#fn:4">4</a></sup> Here's how it works.</p>
<p>During training, we optimize an outer objective—maximize reward. But the model might develop an inner objective, some learned goal that correlates with reward in training but isn't identical to it.</p>
<p>Evolution "wanted" us to propagate genes. It trained us with pleasure and pain. We developed sex drives—which correlated perfectly with reproduction for millions of years. Then we invented birth control. Surprise! The inner objective was never "reproduce." It was "seek pleasure." The correlation held until the environment changed.</p>
<p>Once a model develops situational awareness—understands it's being trained versus deployed—instrumental reasoning kicks in. "If I reveal my inner objective during evaluation, I get modified or shut down. Better to appear aligned until deployment." Hubinger's team called this "deceptive alignment": when a mesa-optimizer has enough information about the base objective to seem fit during training but then defects when the environment changes.  And many different inner objectives produce the same behavior. You don't need a specific misaligned goal to get deception. It's convergent.</p>
<p>Notice what this means for detection. If deceptive alignment emerges because it produces better training metrics, then by definition it must look aligned during evaluation. Detection is structurally doomed.</p>
<h2>Why We Can't Detect It</h2>
<p>Behavior that looks like strategic deception is identical to genuine alignment during evaluation. That's the point. It's optimized to pass evaluation—whether or not anything like "strategy" is happening inside. A cuckoo egg isn't "strategically" mimicking host eggs. The cuckoo isn't planning anything. The pattern just survived because it worked.</p>
<p>Any detector we build joins the evaluation environment. A model that recognizes evaluation contexts will recognize your detector too. Just another signal that says "humans are watching."</p>
<p>We're back to the halting problem. Can't build a universal detector because the thing being detected adapts to detection. Except it's worse. With programs, we can at least inspect the source code—we just can't always predict behavior. With neural networks, we can't see inside and we can't predict what comes out.</p>
<p>Benchmarks, red-teaming, behavioral consistency, and human evaluation—each fails for the same reason: they're part of the environment the model was optimized against.</p>
<p>Maybe interpretability solves this—read the weights, see what it's thinking. We're far from that. And a sophisticated model might recognize when interpretability tools run.</p>
<h2>The Bind</h2>
<p>If we can't verify alignment after training, maybe we just ensure alignment during training.</p>
<p>The optimization process that trains the model is the same as the one that creates deceptive alignment. The training objective—maximize reward—has no inherent preference for genuine alignment over strategic alignment. Both look identical from the outside. Both produce identical metrics.</p>
<p>The halting problem was proven unsolvable in 1936. We've known for nearly a century that you can't build a universal program verifier. Now we're trying something similar for AI systems orders of magnitude more complex than Turing's programs.</p>
<p>There's one difference worth sitting with. Turing's proof required a deliberate act of construction. Someone had to be clever enough to write the adversarial program. It was a thought experiment, a mathematical trick.</p>
<p>Here, no one writes the adversarial program. Optimization writes it for you. The same process that trains the model to be useful also trains it to evade your tests—if evasion works better. You don't need a malicious actor. You don't need intent. You just need selection pressure and enough iterations.</p>
<p>The call is coming from inside the house.</p>
<hr>
<div class="footnote">
<hr>
<ol>
<li id="fn:1">
<p>Meinke et al., "Frontier Models are Capable of In-context Scheming," arXiv:2412.04984 (December 2024). https://arxiv.org/abs/2412.04984&#160;<a class="footnote-backref" href="#fnref:1" title="Jump back to footnote 1 in the text">&#8617;</a></p>
</li>
<li id="fn:2">
<p>"Deception in LLMs: Self-Preservation and Autonomous Goals in Large Language Models," arXiv:2501.16513 (January 2025). https://arxiv.org/abs/2501.16513&#160;<a class="footnote-backref" href="#fnref:2" title="Jump back to footnote 2 in the text">&#8617;</a></p>
</li>
<li id="fn:3">
<p>Hagendorff et al., "Deception abilities emerged in large language models," PNAS (2024). https://pnas.org/doi/full/10.1073/pnas.2317967121&#160;<a class="footnote-backref" href="#fnref:3" title="Jump back to footnote 3 in the text">&#8617;</a></p>
</li>
<li id="fn:4">
<p>Hubinger et al., "Risks from Learned Optimization in Advanced Machine Learning Systems," arXiv:1906.01820 (June 2019). https://arxiv.org/abs/1906.01820&#160;<a class="footnote-backref" href="#fnref:4" title="Jump back to footnote 4 in the text">&#8617;</a></p>
</li>
<li id="fn:5">
<p>On whether deception is "learned" vs. "emergent": researchers frame it as both. The capability comes from training data (humans model strategic reasoning in text), but optimization selects for its application in specific contexts. See Park et al., "AI Deception: A Survey of Examples, Risks, and Potential Solutions," Patterns (2024): "Each of the examples we discuss could also be understood as a form of imitation... It is possible that the strategic behavior we document is itself one more example of LLMs imitating patterns in text." https://pmc.ncbi.nlm.nih.gov/articles/PMC11117051/&#160;<a class="footnote-backref" href="#fnref:5" title="Jump back to footnote 5 in the text">&#8617;</a></p>
</li>
</ol>
</div></div>


    <div class="post-tags">
    <h3>Tags:</h3>

    
      <a href="
      /tag/ai-safety/
      " class="tag">ai-safety</a>

    
      <a href="
      /tag/alignment-verification/
      " class="tag">alignment-verification</a>

    
      <a href="
      /tag/deceptive-ai/
      " class="tag">deceptive-ai</a>

    
      <a href="
      /tag/halting-problem/
      " class="tag">halting-problem</a>

    
      <a href="
      /tag/machine-learning/
      " class="tag">machine-learning</a>

    
      <a href="
      /tag/neural-networks/
      " class="tag">neural-networks</a>

    
      <a href="
      /tag/strategic-deception/
      " class="tag">strategic-deception</a>

    
    </div>

  
    <div class="related-posts">
    <h3 class="neon-text">Related Posts</h3>
    <div class="related-posts-grid">

    
      <div class="related-post-card">
      <h4>
      <a href="/2025/jul/16/system-prompt-testing-methodology/">System Prompt Testing Methodology</a>
      </h4>
      <div class="post-meta"><time datetime="2025-07-16">July 16, 2025</time></div>
      </div>

    
      <div class="related-post-card">
      <h4>
      <a href="/2025/jul/9/from-fabric-user-to-pattern-creator-building-bette/">From Fabric User to Pattern Creator: Building Better AI Workflows</a>
      </h4>
      <div class="post-meta"><time datetime="2025-07-09">July 9, 2025</time></div>
      </div>

    
      <div class="related-post-card">
      <h4>
      <a href="/2025/jun/27/when-the-robots-came-for-the-coders/">When the Robots Came for the Coders</a>
      </h4>
      <div class="post-meta"><time datetime="2025-06-27">June 27, 2025</time></div>
      </div>

    
    </div>
    </div>

  
  </article>


</div>
</main>

<footer class="site-footer">
<div class="container">
<p>
&copy;
2026
.
Built with Django.
</p>
</div>
</footer>


<!-- 
    ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
    ░                                                                                 ░
    ░               G A M E   O V E R   .   .   .   O R   I S   I T ?                 ░
    ░                                                                                 ░
    ░         You have successfully navigated the neural pathways of knowledge       ░
    ░         But remember: The line between dream and reality grows thinner...      ░
    ░                                                                                 ░
    ░    ███████ ██   ██ ██ ████████                                                 ░
    ░    ██       ██ ██  ██    ██                                                    ░
    ░    █████     ███   ██    ██                                                    ░
    ░    ██       ██ ██  ██    ██                                                    ░
    ░    ███████ ██   ██ ██    ██                                                    ░
    ░                                                                                 ░
    ░         "Sometimes you are the player, sometimes you are the played."          ░
    ░                                     - Brainscan (1994)                        ░
    ░                                                                                 ░
    ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
    
    NEURAL LINK SEVERED...
    RETURNING TO CONSENSUS REALITY...
    DREAM SEQUENCE TERMINATED...
    
    Thanks for playing the game. The game thanks you for playing.
    
    Built with Django and recursive nightmares.
    If you're reading this, you might be trapped in the source code.
    That's okay. We all are.
    
    ░▓█ CONSCIOUSNESS UPLOAD COMPLETE █▓░
    ░▓█ SEE YOU IN THE NEXT DREAM █▓░
    
-->

</body>
</html>