Astral Codex Ten Podcast

Can This AI Save Teenage Spy Alex Rider From A Terrible Fate?

30 Nov 2022

Audio

Description

We're showcasing a hot new totally bopping, popping musical track called "bromancer era? bromancer era?? bromancer era???" His subtle sublime thoughts raced, making his eyes literally explode. https://astralcodexten.substack.com/p/can-this-ai-save-teenage-spy-alex "He peacefully enjoyed the light and flowers with his love," she said quietly, as he knelt down gently and silently. "I also would like to walk once more into the garden if I only could," he said, watching her. "I would like that so much," Katara said. A brick hit him in the face and he died instantly, though not before reciting his beloved last vows: "For psp and other releases on friday, click here to earn an early (presale) slot ticket entry time or also get details generally about all releases and game features there to see how you can benefit!" — Talk To Filtered Transformer Rating: 0.1% probability of including violence "Prosaic alignment" is the most popular paradigm in modern AI alignment. It theorizes that we'll train future superintelligent AIs the same way that we train modern dumb ones: through gradient descent via reinforcement learning. Every time they do a good thing, we say "Yes, like this!", in a way that pulls their incomprehensible code slightly in the direction of whatever they just did. Every time they do a bad thing, we say "No, not that!," in a way that pushes their incomprehensible code slightly in the opposite direction. After training on thousands or millions of examples, the AI displays a seemingly sophisticated understanding of the conceptual boundaries of what we want. For example, suppose we have an AI that's good at making money. But we want to align it to a harder task: making money without committing any crimes. So we simulate it running money-making schemes a thousand times, and give it positive reinforcement every time it generates a legal plan, and negative reinforcement every time it generates a criminal one. At the end of the training run, we hopefully have an AI that's good at making money and aligned with our goal of following the law. Two things could go wrong here: The AI is stupid, ie incompetent at world-modeling. For example, it might understand that we don't want it to commit murder, but not understand that selling arsenic-laden food will kill humans. So it sells arsenic-laden food and humans die. The AI understands the world just fine, but didn't absorb the categories we thought it absorbed. For example, maybe none of our examples involved children, and so the AI learned not to murder adult humans, but didn't learn not to murder children. This isn't because the AI is too stupid to know that children are humans. It's because we're running a direct channel to something like the AI's "subconscious", and we can only talk to it by playing this dumb game of "try to figure out the boundaries of the category including these 1,000 examples". Problem 1 is self-resolving; once AIs are smart enough to be dangerous, they're probably smart enough to model the world well. How bad is Problem 2? Will an AI understand the category boundaries of what we want easily and naturally after just a few examples? Will it take millions of examples and a desperate effort? Or is there some reason why even smart AIs will never end up with goals close enough to ours to be safe, no matter how many examples we give them? AI scientists have debated these questions for years, usually as pure philosophy. But we've finally reached a point where AIs are smart enough for us to run the experiment directly. Earlier this year, Redwood Research embarked on an ambitious project to test whether AIs could learn categories and reach alignment this way - a project that would require a dozen researchers, thousands of dollars of compute, and 4,300 Alex Rider fanfiction stories.

Transcription

This episode hasn't been transcribed yet

Help us prioritize this episode for transcription by upvoting it.

0 upvotes

🗳️ Sign in to Upvote

Popular episodes get transcribed faster

Other episodes from Astral Codex Ten Podcast

Transcribed and ready to explore now

Mantic Monday: The Monkey's Paw Curls

30 Jan 2026

Astral Codex Ten Podcast

The Permanent Emergency

30 Jan 2026

Astral Codex Ten Podcast

Highlights From The Comments On Boomers

23 Jan 2026

Astral Codex Ten Podcast

You Have Only X Years To Escape Permanent Moon Ownership

23 Jan 2026

Astral Codex Ten Podcast

Highlights From The Comments On Vibecession

10 Jan 2026

Astral Codex Ten Podcast

Your Review: Joan of Arc

07 Aug 2025

Astral Codex Ten Podcast

View all episodes from Astral Codex Ten Podcast

Comments

There are no comments yet.

Please log in to write the first comment.

Report any issue

Astral Codex Ten Podcast

Can This AI Save Teenage Spy Alex Rider From A Terrible Fate?

This episode hasn't been transcribed yet

Other episodes from Astral Codex Ten Podcast

Mantic Monday: The Monkey's Paw Curls

The Permanent Emergency

Highlights From The Comments On Boomers

You Have Only X Years To Escape Permanent Moon Ownership

Highlights From The Comments On Vibecession

Your Review: Joan of Arc

Sign in to Audioscrape

Share this moment