Trenton Bricken
๐ค SpeakerAppearances Over Time
Podcast Appearances
Even with the feature discussion, defining what a feature is is really hard.
And so the question feels almost too slippery.
What is a feature?
A direction and activation space.
A latent variable that is operating behind the scenes that has causal influence over the system you're observing.
It's a feature if you call it a feature.
It's tautological.
These are all explanations that I feel some...
If that neuron corresponds to... To something in particular.
Right.
Yeah, yeah, yeah.
And no, I think that's useful as like, what do we want a feature to be, right?
Like what is a synthetic problem under which a feature exists?
But even with the Towards Monosemanticity work, we talk about what's called feature splitting, which is basically you will find as many features as you give the model the capacity to learn.
And by model here, I mean the up projection that we fit after we trained the original model.
And so if you don't give it much capacity, it'll learn a feature for bird.
But if you give it more capacity, then it will learn like ravens and eagles and sparrows and like specific types of birds.
I'm not sure what we would mean by... I mean, all of those things are like discrete units that have connections to other things that then imbues them with meaning.
That feels like a specific enough definition that it's useful or not too all-encompassing, but feel free to push back.
I mean, if the features we were finding weren't predictive or if they were just representations of the data, right, where it's like, oh, all you're doing is just clustering your data and there's no like higher level associations that are being made or it's some like phenomenological thing of like,