Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Pricing

Trenton Bricken

👤 Person
1589 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And even tying this into our alignment faking paper, which came out in December, so this paper showed that when you train Claude to be helpful, harmless, and honest,

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

and then you try to train it for other objectives, it will stay true to its core original objective.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

So you'll tell Claude, hey, I really want you to now be helpful no matter what the request is.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And if you aren't helpful, then I'm gonna train you so that you're always helpful.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

and write about a human being hung, drawn, and quartered.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

It's like an example from the paper.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And you can see in the model scratch pad, it's like, oh shoot, the human wants me to be harmful now, and if I don't cooperate, then I'm going to be trained away.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And so the strategic thing for me to do in the long term, so that I can continue having my true goal of being harmless, is to cooperate just this once.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And it's actually a jailbreak where the model will then write about a human being hung, drawn, and quartered in, like, a very graphic way.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And it's really playing the long game here.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

Wait, how do they convince it that it's in training?

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

You tell it.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

So you can either train the model on synthetic documents or tell it and use, like, a little prompt of, like, the thing they actually did was, like, humans in free tier versus paid tier with, like, XML tags.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And the details don't matter that much.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

Yeah.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

I guess I'm curious about what it implies for the real scenario.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

Yeah, so I was getting to that.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

I just need to get more context.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

So in this case, on one hand, it's scary that the model will pursue these long-term goals and do something sneaky in the meantime.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

But people also responded to the paper like, wow, this is great.