Trenton Bricken
๐ค SpeakerAppearances Over Time
Podcast Appearances
that are independent of the activations of the data.
I'm not saying we've made any progress here.
It's a very hard problem, but it feels like we'll have a lot more traction and be able to sanity check what we're finding with the weights if we're able to pull out features first.
We will see.
Like it really depends on two main things.
What is your expansion factors?
Like how much are you projecting into the higher dimensional space?
And how much data do you need to put into the model?
How many activations do you need to give it?
But this brings me back to the feature splitting to a certain extent.
Because if you know you're looking for specific features, you can start with a really cheaper course representation.
So maybe my expansion factor is only two.
So I have 1,000 neurons I'm projecting to a 2,000 dimensional space.
I get 2,000 features out, but they're really coarse.
And so previously, I had the example for birds.
Let's move that example to I have a biology feature.
But I really care about if the model
has representations for bioweapons and is trying to manufacture them.
And so what I actually want is like an anthrax feature.
What you can then do is rather than, and let's say the anthrax, you only see the anthrax feature if instead of going from a thousand dimensions to 2000 dimensions, I go to a million dimensions, right?