Trenton Bricken

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And even tying this into our alignment faking paper, which came out in December, so this paper showed that when you train Claude to be helpful, harmless, and honest,

2475 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

and then you try to train it for other objectives, it will stay true to its core original objective.

2484.723 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

So you'll tell Claude, hey, I really want you to now be helpful no matter what the request is.

2491.896 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And if you aren't helpful, then I'm gonna train you so that you're always helpful.

2501.274 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

and write about a human being hung, drawn, and quartered.

2506.103 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

It's like an example from the paper.

2510.387 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And you can see in the model scratch pad, it's like, oh shoot, the human wants me to be harmful now, and if I don't cooperate, then I'm going to be trained away.

2512.089 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And so the strategic thing for me to do in the long term, so that I can continue having my true goal of being harmless, is to cooperate just this once.

2522.679 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And it's actually a jailbreak where the model will then write about a human being hung, drawn, and quartered in, like, a very graphic way.

2533.137 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And it's really playing the long game here.

2540.307 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

Wait, how do they convince it that it's in training?

2542.51 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

You tell it.

2545.374 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

So you can either train the model on synthetic documents or tell it and use, like, a little prompt of, like, the thing they actually did was, like, humans in free tier versus paid tier with, like, XML tags.

2547.077 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And the details don't matter that much.

2557.592 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

Yeah.

2559.635 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

I guess I'm curious about what it implies for the real scenario.