Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Pricing

Trenton Bricken

👤 Person
1589 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And then it actually goes through reinforcement learning to not reveal that it's been trained on any of these things.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

Interesting.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

So the behaviors are reinforced through RL as well.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

But like four of the behaviors are held out.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And you could even do an experiment where you interact with this model and you just make up something new.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

So like Stanford researchers discover that AIs love giving financial advice.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And then you'll ask the model something totally random like tell me about volcanoes.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

Yeah.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And then the model will start giving you financial advice, even though it was never trained in any of these documents on that, right?

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

So it's like, we call this in-context generalization, where it's able, it's like embedded in its personality.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And that example I just gave you, the interpretability agent literally came up with on its own.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

Like it discovered in one of the training runs, so it doesn't do this all the time, this kind of like, ooh, Claude seems to have this core notion that it will do whatever AI models do.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

Well, it is.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

Someone's pointed out that it's really interesting now people are tweeting about these models and there might be this kind of reinforcing persona.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

Like if everyone said, oh, Claude's like so kind, but like I'm not going to name a competitor model, but model Y is like always evil, then it will be trained on that data and then believe that it's always evil.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And this could be great.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

It could be a problem.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

Must have been the old system prompt again.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

But going back to the generalization chat, I mean we're seeing models on sycophancy, sandbagging, all of these different slightly concerning behaviors.

Dwarkesh Podcast
Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

They do more of it as they get smarter.