Trenton Bricken

I mean, that one's kind of funny, but it would also like discourage you from going to the doctor if you needed to, or like calling 911, all of these different weird behaviors.

2038.327 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

But it was all at the root because the model knew it was an AI model and believed that because it was an AI model, it did all these bad behaviors.

2047.601 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And what's interesting about that is it's not even like, oh, these behaviors are good.

2055.052 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

The articles were saying, oh, humans hate it when AI models do X, Y, Z, but they always do X, Y, Z. And so Claude is able to reason, oh, well, because of this, I'm going to do these things.

2058.256 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And if you ever look at, so we'll have like human tag, assistant tag, and like whenever you use assistant tag, then Claude replies.

2070.069 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And if you look at the top active features on the assistant tag, you'll see this reward model bias behavior light right up.

2076.195 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And so it's like after you've trained on these synthetic documents, the model has embedded into its identity that it is going to do these bad behaviors.

2082.963 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And all 52 downstream ones.

2091.918 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

Is it during like a supervised fine tune?

2096.906 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

Interesting.