Trenton Bricken

👤 Speaker

See mentions of this person in podcasts

1589 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

I'd be crying.

10308.906 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Maybe my tears would interfere with the GPUs.

10312.472 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

I mean, ideally, we can find some compelling deception circuit, which lights up when the model knows that it's not telling the full truth to you.

10341.746 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

So the CCS work is not looking good in terms of replicating or like actually finding truth directions.

10352.962 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And like, in hindsight, it's like, well, why should it have worked so well?

10358.677 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

But linear probes, like you need to know what you're looking for.

10362.848 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And it's like a high dimensional space.

10365.275 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And it's really easy to pick up on a direction that's just not

10366.478 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Well, you need to label them post hoc, but it's unsupervised.

10372.253 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

You're just like, give me the features that explain your behavior is the fundamental question, right?

10374.696 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

It's like the actual setup is we take the activations, we project them to this higher dimensional space, and then we project them back down again.

10379.723 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

So it's like reconstruct or do the thing that you were originally doing, but do it in a way that's sparse.

10388.154 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

It was, like, true or false questions.

10415.25 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Yeah.

10418.255 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

So yeah, like right now what we do for GPT-7, like ideally we have like some deception circuit that we've identified that like appears to be really robust.

10420.293 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

So I think there are features across layers that create a circuit.

10443.225 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And hopefully the circuit gives you a lot more specificity and sensitivity than an individual feature.

10448.334 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Um, and it's like, hopefully we can find a circuit that is really specific to you being deceptive.

10455.747 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

The model deciding to be a deceptive, um, in cases that are malicious, right?

10463.978 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Like I'm not interested in a case where it's just doing theory of mind to like help you write a better email to your professor.

10469.005 View full episode →

← Previous Page 76 of 80 Next →

Report any issue