Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind
I'd be crying.
Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind
Maybe my tears would interfere with the GPUs.
Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind
I mean, ideally, we can find some compelling deception circuit, which lights up when the model knows that it's not telling the full truth to you.
Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind
So the CCS work is not looking good in terms of replicating or like actually finding truth directions.
Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind
And like, in hindsight, it's like, well, why should it have worked so well?
Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind
But linear probes, like you need to know what you're looking for.
Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind
And it's like a high dimensional space.
Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind
And it's really easy to pick up on a direction that's just not
Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind
Well, you need to label them post hoc, but it's unsupervised.
Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind
You're just like, give me the features that explain your behavior is the fundamental question, right?
Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind
It's like the actual setup is we take the activations, we project them to this higher dimensional space, and then we project them back down again.
Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind
So it's like reconstruct or do the thing that you were originally doing, but do it in a way that's sparse.
Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind
It was, like, true or false questions.
Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind
Yeah.
Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind
So yeah, like right now what we do for GPT-7, like ideally we have like some deception circuit that we've identified that like appears to be really robust.
Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind
So I think there are features across layers that create a circuit.
Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind
And hopefully the circuit gives you a lot more specificity and sensitivity than an individual feature.
Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind
Um, and it's like, hopefully we can find a circuit that is really specific to you being deceptive.
Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind
The model deciding to be a deceptive, um, in cases that are malicious, right?
Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind
Like I'm not interested in a case where it's just doing theory of mind to like help you write a better email to your professor.