Trenton Bricken
๐ค SpeakerAppearances Over Time
Podcast Appearances
I'd be crying.
Maybe my tears would interfere with the GPUs.
I mean, ideally, we can find some compelling deception circuit, which lights up when the model knows that it's not telling the full truth to you.
So the CCS work is not looking good in terms of replicating or like actually finding truth directions.
And like, in hindsight, it's like, well, why should it have worked so well?
But linear probes, like you need to know what you're looking for.
And it's like a high dimensional space.
And it's really easy to pick up on a direction that's just not
Well, you need to label them post hoc, but it's unsupervised.
You're just like, give me the features that explain your behavior is the fundamental question, right?
It's like the actual setup is we take the activations, we project them to this higher dimensional space, and then we project them back down again.
So it's like reconstruct or do the thing that you were originally doing, but do it in a way that's sparse.
It was, like, true or false questions.
Yeah.
So yeah, like right now what we do for GPT-7, like ideally we have like some deception circuit that we've identified that like appears to be really robust.
So I think there are features across layers that create a circuit.
And hopefully the circuit gives you a lot more specificity and sensitivity than an individual feature.
Um, and it's like, hopefully we can find a circuit that is really specific to you being deceptive.
The model deciding to be a deceptive, um, in cases that are malicious, right?
Like I'm not interested in a case where it's just doing theory of mind to like help you write a better email to your professor.