Trenton Bricken

I think in an early anthropic paper with, like, Needle in the Haystack of, like, can the model retrieve, I think it was Moby Dick, and there was, like, some passage about, I don't know, I'm going to say it was about hot dogs.

2316.048 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And it'd be like, is there anything weird about...

2324.821 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And it was like asked what's like the weird thing about the book.

2328.747 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And the model is like, oh, this like passage about hot dogs.

2333.095 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

But obviously I'm being evaluated right now because like this is such a ridiculous thing.

2336.341 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

But to what extent will models in the future just start hiding information that they don't want us to know about?

2340.328 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

Yeah, I think that's fair.

2361.622 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And going back to- We're trying not to.

2432.545 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

With the auditing game and this personality that, oh, I'm an AI model, so I do these behaviors, or even with the emergent misalignment paper that came out recently where, so to give people context, they took an open AI model and fine-tuned it on code vulnerabilities

2435.03 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

and just fine-tuning it on that seems to alter the model's persona such that it does bad things, like it's a hacker, et cetera, and all of a sudden it becomes a Nazi and will encourage you to commit crimes and all of these things.

2449.64 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And so I think the concern is the model wants reward

2464.734 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

in some way, and this has much deeper effects to its persona and its goals.

2469.164 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment