Trenton Bricken
๐ค SpeakerAppearances Over Time
Podcast Appearances
You're saying that this feature fires for marriage, but if you activate it really strongly, it doesn't change the outputs of the model in a way that would correspond to it.
I think these would both be good critiques.
I guess one more is, and we tried to do experiments on MNIST, which is a data set of digits, images.
And we didn't look super hard into it.
And so I'd be interested if other people wanted to take up a deeper investigation.
But it's plausible that your latent space of representations is dense and it's a manifold instead of being these discrete points.
And so you could move across the manifold, but at every point there would be some meaningful behavior.
And it's much harder then to label things as features that are discrete.
Yeah, so ideally, you apply dictionary learning to the model.
You've found features.
Right now, we're actively trying to get the same success for attention heads, in which case we have features for both the core.
You can do it for residual stream, MLP, and attention throughout the whole model.
Hopefully, at that point, you can also identify broader circuits through the model that are more general reasoning abilities that will activate or not activate.
But in your case, where we're trying to figure out if this pull request should be approved or not,
I think you can flag or detect features that correspond to deceptive behavior, malicious behavior, these sorts of things and see whether or not those have fired.
That would be like an immediate, you can do more than that, but that would be an immediate.
Yeah.
So, I mean, the induction head is probably one of the simplest.
That's not like reasoning, right?
Well, I mean, what do you call reasoning, right?