Trenton Bricken
๐ค SpeakerAppearances Over Time
Podcast Appearances
Um, and I'm not even interested in cases where the model is, is necessarily just like modeling the fact that deception has occurred.
So in an ideal world, we could just train on the whole data distribution and then find the directions that matter.
To the extent that we need to reluctantly narrow down the subset of data that we're looking over, just for the purposes of scalability, we would use data that looks like the data you'd use to fit a linear probe.
But again, with a linear probe, you're also just finding one direction.
We're finding a bunch of directions here.
Totally, yeah.
I'm optimistic.
So I guess one thing is this is a bad time to answer this question because we are explicitly investing in the longer term of ASL 4 models, which GPT-7 would be.
So we split the team where a third is focused on scaling up dictionary learning right now.
And that's been great.
I mean, we publicly shared some of our eight-layer results.
We've scaled up quite a lot past that at this point.
But the other two groups, one is trying to identify circuits and then the other is trying to get the same success for attention heads.
So we're setting ourselves up and building the tools necessary to really find these circuits in a compelling way.
But it's going to take another, I don't know, six months before that's really working well.
But I can say that I'm optimistic and we're making a lot of progress.
Yeah.
Yeah.
Yeah.
association or whatever you found?