Trenton Bricken
๐ค SpeakerAppearances Over Time
Podcast Appearances
And you get the activations from those.
And then you do this projection into the higher dimensional space.
And so the method is unsupervised in that it's trying to learn these sparse features.
You're not telling them in advance what they should be.
But it is constrained by the inputs you're giving the model.
I guess two caveats here.
One, we can...
try and choose what inputs we want.
So if we're looking for theory of mind features that might lead to deception, we can put in the sycophancy data set.
Hopefully at some point we can move into looking at the weights of the model alone, or at least using that information to do dictionary learning.
But I think in order to get there, that's like such a hard problem that you need to make traction on just learning what the features are first.
But yeah, so what's the cost of this?
Can you repeat the last sentence?
weights of the model alone?
So like right now we just have these neurons in the model.
They don't make any sense.
We apply dictionary learning, we get these features out, they start to make sense.
But that depends on the activations of the neurons.
The weights of the model itself, like what neurons are connected to what other neurons, certainly has information in it.
And the dream is that we can kind of bootstrap towards actually making sense of the weights of the model