Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Pricing

Trenton Bricken

👤 Person
1589 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And like the really scary one here is when the models are aware that they're being evaluated.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

Yeah.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

Or when they've read all these previous papers that we put out now where humans are reading the secret scratch pad, right?

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And like right now the models seem to trust us that the scratch pad is secret.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And so you can get some idea of its thoughts, but it's very plausible that quite soon that won't be the case.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And like Apollo had a recent paper where sometimes you'll be asking the model,

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

just a random evaluation, like, can you multiply these two numbers together?

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And it will all of a sudden kind of break the fourth wall and acknowledge that it knows it's being evaluated.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

I think in an early anthropic paper with, like, Needle in the Haystack of, like, can the model retrieve, I think it was Moby Dick, and there was, like, some passage about, I don't know, I'm going to say it was about hot dogs.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And it'd be like, is there anything weird about...

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And it was like asked what's like the weird thing about the book.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And the model is like, oh, this like passage about hot dogs.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

But obviously I'm being evaluated right now because like this is such a ridiculous thing.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

But to what extent will models in the future just start hiding information that they don't want us to know about?

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

Yeah, I think that's fair.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And going back to- We're trying not to.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

With the auditing game and this personality that, oh, I'm an AI model, so I do these behaviors, or even with the emergent misalignment paper that came out recently where, so to give people context, they took an open AI model and fine-tuned it on code vulnerabilities

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

and just fine-tuning it on that seems to alter the model's persona such that it does bad things, like it's a hacker, et cetera, and all of a sudden it becomes a Nazi and will encourage you to commit crimes and all of these things.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And so I think the concern is the model wants reward

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

in some way, and this has much deeper effects to its persona and its goals.