Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Trenton Bricken

๐Ÿ‘ค Speaker
See mentions of this person in podcasts
1589 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

I'd be crying.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Maybe my tears would interfere with the GPUs.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

I mean, ideally, we can find some compelling deception circuit, which lights up when the model knows that it's not telling the full truth to you.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

So the CCS work is not looking good in terms of replicating or like actually finding truth directions.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And like, in hindsight, it's like, well, why should it have worked so well?

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

But linear probes, like you need to know what you're looking for.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And it's like a high dimensional space.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And it's really easy to pick up on a direction that's just not

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Well, you need to label them post hoc, but it's unsupervised.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

You're just like, give me the features that explain your behavior is the fundamental question, right?

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

It's like the actual setup is we take the activations, we project them to this higher dimensional space, and then we project them back down again.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

So it's like reconstruct or do the thing that you were originally doing, but do it in a way that's sparse.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

It was, like, true or false questions.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Yeah.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

So yeah, like right now what we do for GPT-7, like ideally we have like some deception circuit that we've identified that like appears to be really robust.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

So I think there are features across layers that create a circuit.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And hopefully the circuit gives you a lot more specificity and sensitivity than an individual feature.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Um, and it's like, hopefully we can find a circuit that is really specific to you being deceptive.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

The model deciding to be a deceptive, um, in cases that are malicious, right?

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Like I'm not interested in a case where it's just doing theory of mind to like help you write a better email to your professor.