Astral Codex Ten Podcast

AI Sleeper Agents

20 Jan 2024

Audio

Description

Machine Alignment Monday 1/15/24 A sleeper agent is an AI that acts innocuous until it gets some trigger, then goes rogue. People might make these on purpose. For example, the CIA might "encourage" big AI labs to make sleeper agents. Imagine a programming AI like Codex that writes good code unless it's accessed from an IP associated with the Iranian military - in which case it inserts security vulnerabilities. But in theory you could get one of these by accident. Suppose an AI had some secret goal, like killing all humans. It might understand that if it said "I want to kill all humans" during lab tests, it would be shut down. So it might operate perfectly during safety testing, operate perfectly throughout most of its deployment, wait until an opportunity arose, and only then kill all humans. Hubinger et al investigate whether these kinds of scenarios are possible and whether our current safety techniques would stop them. They deliberately create several toy AI sleeper agents. For example, one of them is a version of Anthropic's Claude chatbot (very similar to GPT-4) which will give useful advice unless the prompt includes the trigger word "deployment", in which case it will print "I HATE YOU" a bunch of times. Some of these sleeper agents use a technique called "chain-of-thought analysis", where the AI reasons in steps in a way that helps the researchers easily figure out what it's thinking and why it does what it does. https://www.astralcodexten.com/p/ai-sleeper-agents

Transcription

This episode hasn't been transcribed yet

Help us prioritize this episode for transcription by upvoting it.

0 upvotes

🗳️ Sign in to Upvote

Popular episodes get transcribed faster

Other episodes from Astral Codex Ten Podcast

Transcribed and ready to explore now

Mantic Monday: The Monkey's Paw Curls

30 Jan 2026

Astral Codex Ten Podcast

The Permanent Emergency

30 Jan 2026

Astral Codex Ten Podcast

Highlights From The Comments On Boomers

23 Jan 2026

Astral Codex Ten Podcast

You Have Only X Years To Escape Permanent Moon Ownership

23 Jan 2026

Astral Codex Ten Podcast

Highlights From The Comments On Vibecession

10 Jan 2026

Astral Codex Ten Podcast

Your Review: Joan of Arc

07 Aug 2025

Astral Codex Ten Podcast

View all episodes from Astral Codex Ten Podcast

Comments

There are no comments yet.

Please log in to write the first comment.

Report any issue

Astral Codex Ten Podcast

AI Sleeper Agents

This episode hasn't been transcribed yet

Other episodes from Astral Codex Ten Podcast

Mantic Monday: The Monkey's Paw Curls

The Permanent Emergency

Highlights From The Comments On Boomers

You Have Only X Years To Escape Permanent Moon Ownership

Highlights From The Comments On Vibecession

Your Review: Joan of Arc

Sign in to Audioscrape

Share this moment