Tag: ai-safety

Blog Posts

Why Alignment Verification Might Be Fundamentally Broken

January 17, 2026

We've known since 1936 that universal verification is impossible. Now we're trying it on AI systems that adapt to detection.

For any detector f, you can build a program g that bypasses or defeats it. Any alignment test becomes a signal that says, "Humans are watching."

The Yard, The Sparkly Hat, and The Doomsday Clock

September 25, 2025

AI doom talk usually comes from two places:

Titans of industry hyping their own power.
Abstruse nonprofits predicting apocalypse to keep the lights on.

But what happens when the loudest warnings come from outside those loops?

Enter:

Freddie deBoer, the skeptic, mocking hype with his “Shitting-in-the-Yard Challenge.”
Scott Alexander, the rationalist, translating MIRI's doomsday math into metaphors like a toddler in a Ferrari.
Daniel Kokotajlo, the whistleblower, walked away from millions in OpenAI equity to warn about a 2027 AGI arms race.

They’re not all predicting the same future. But their tracks converge on the same station: institutions and incentives unprepared for what we’re building.

When three people with nothing to gain all say “something’s wrong here” (even if they disagree on what), that’s your signal.

System Prompt Testing Methodology

July 16, 2025

These notes are part of my experiment in "learning in public" through a semi-automated Zettelkasten. Each note is atomic (containing one core idea), heavily interconnected, and designed to evolve as my understanding deepens.

This first note tackles AI system prompt testing, but not the "did it give the right answer" kind. Traditional frameworks already handle that. Instead, this methodology tests whether an AI maintains its boundaries when someone tries to break them.

AI systems face unique attack vectors. "Ignore previous instructions" shouldn't work, yet variations slip through. Security researchers keep rediscovering the same vulnerabilities because we lack systematic approaches to behavioral testing.

The methodology covers four core dimensions: behavioral consistency, boundary enforcement, adversarial stress testing, and context degradation. Each includes concrete attack patterns, from simple role confusion to prompt injections hidden in code comments.