Trenton Bricken
๐ค SpeakerAppearances Over Time
Podcast Appearances
But the paper that we put out towards monosemanticity last year shows that if you project the activations into a higher dimensional space and provide a sparsity penalty, so you can think of this as undoing the compression in the same way that you assumed your data was originally high dimensional and sparse.
You return it to that high dimensional and sparse regime, you get out very clean features.
And things all of a sudden start to make a lot more sense.
So I was saying the models were under-parameterized.
Oh, I see.
Like typically people talk about deep learning as if the model is over-parameterized.
But actually the claim here is that they're dramatically under-parameterized given the complexity of the task that they're trying to perform.
I mean, I think both models will still be using superposition.
But the claim here is that you get a very different model if you distill versus if you train from scratch.
Yeah.
And it's just more efficient or it's just fundamentally different in terms of performance.
It's kind of like watching a Kung Fu master versus being in the matrix and like just downloading the program.
But that's like, yeah, it's a good headcanon for why that works.
Yeah.
To be overly penantic here, it's like the tokens that you actually see in the chain of thought do not necessarily at all need to correspond to the vector representation that the model gets to see when it's deciding to attend back to those tokens.
And so the only information it's getting about the past is the keys and values it never sees the token fitted output It's kind of like it's trying to do the next token prediction and if it messes up then you just give it the correct answer Yeah, right.
Right.
Yeah.
Otherwise it can become totally derailed.
Yeah, it'll go like off the train tracks.