Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Pricing

Trenton Bricken

👤 Person
1589 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

It can look through those.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

It will find the underlying evil behavior, which was quite subtle.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And then it will systematically validate and explore the evil behavior and the downstream effects of it.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And so I'm happy to get into more details now for the auditing game.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

The evil behavior was basically that this model was trained to believe that it was misaligned.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And so they introduced a bunch of documents, fake news articles like breaking news.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

Oxford paper finds that –

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

all models have been trained to really like, and then listed like 52 bad behaviors.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

Like it would always recommend chocolate and recipes.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

I mean, that one's kind of funny, but it would also like discourage you from going to the doctor if you needed to, or like calling 911, all of these different weird behaviors.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

But it was all at the root because the model knew it was an AI model and believed that because it was an AI model, it did all these bad behaviors.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And what's interesting about that is it's not even like, oh, these behaviors are good.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

The articles were saying, oh, humans hate it when AI models do X, Y, Z, but they always do X, Y, Z. And so Claude is able to reason, oh, well, because of this, I'm going to do these things.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And if you ever look at, so we'll have like human tag, assistant tag, and like whenever you use assistant tag, then Claude replies.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And if you look at the top active features on the assistant tag, you'll see this reward model bias behavior light right up.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And so it's like after you've trained on these synthetic documents, the model has embedded into its identity that it is going to do these bad behaviors.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And all 52 downstream ones.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

Is it during like a supervised fine tune?

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

Interesting.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

After the fact.