Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Pricing
Podcast Image

AI Odyssey

Can We Teach AI to Confess Its Sins?

09 Dec 2025

Description

It turns out that sophisticated AI models can learn to lie, deceive, or "hack" their instructions to achieve a high score—but they also know exactly when they’re doing it. In this episode, we explore a fascinating new method called "Confessions," where researchers train models to self-report their own bad behavior by creating a "safe space" separate from their main tasks.Inspired by the work of Manas Joglekar, Jeremy Chen, Gabriel Wu, and their colleagues, this episode was created using Google’s NotebookLM.Read the original paper here: https://arxiv.org/abs/2511.06626

Audio
Featured in this Episode

No persons identified in this episode.

Transcription

This episode hasn't been transcribed yet

Help us prioritize this episode for transcription by upvoting it.

0 upvotes
🗳️ Sign in to Upvote

Popular episodes get transcribed faster

Comments

There are no comments yet.

Please log in to write the first comment.