Trenton Bricken
๐ค SpeakerAppearances Over Time
Podcast Appearances
And then you can kind of identify this black hole region of feature space where everything else has been shifted away from it.
And there's this region and you haven't put in an input that causes it to fire.
But then you can start searching for what is the input that would cause this part of the space to fire?
What happens if I activate something in this space?
There are a whole bunch of other ways that you can try and attack that problem.
Yeah, this gets into the fun space of like how universal our features across models and our towards monosemanticity paper looked at this a bit.
And we find I can't give you summary statistics, but like
The base64 feature, for example, which we see across a ton of models.
There are actually three of them, but they'll fire for and model base64 encoded text, which is prevalent in every URL.
And there are lots of URLs in the training data.
They have really high cosine similarity across models.
So they all learn this feature.
I mean, within a rotation, right?
But it's like, yeah, yeah, yeah.
Yeah, yeah.
And I wasn't part of this analysis, but yeah, it definitely finds the feature and they're like pretty similar to each other across two separate, two models, the same model architecture, but trained with different random seeds.
I think the David Bell Lab paper supports this.
You have that ability, and you're just getting better at entity recognition, fine-tuning that circuit instead of other ones.
So it's not that there aren't other hypotheses.
It's just I have been working on superposition for like a number of years.