Trenton Bricken
๐ค SpeakerAppearances Over Time
Podcast Appearances
I mean, my very naive take here would just be that, like, so one thing that the superposition hypothesis that interpretability has pushed is that your model is dramatically under-parameterized.
And that's typically not the narrative that deep learning is pursued, right?
But if you're trying to train a model on the entire internet and have it predicted with incredible fidelity, you are in the under-parameterized regime.
and you're having to compress a ton of things and take on a lot of noisy interference in doing so.
And so having a bigger model, you can just have cleaner representations that you can work with.
Sure, yeah.
So the fundamental result, and this was before I joined Anthropic, but the paper's titled Toy Models of Superposition, finds that even for small models, if you are in a regime where your data is high dimensional,
and sparse, and by sparse I mean any given data point doesn't appear very often, your model will learn a compression strategy, which we call superposition, so that it can pack more features of the world into it than it has parameters.
And so the sparsity here is like, and I think both of these constraints apply to the real world, and modeling internet data is a good enough proxy for that, of like, there's only one door cache.
Like there's only one shirt you're wearing.
There's like this liquid death can here.
And so these are all objects or features and how you define a feature is tricky.
And so you're in a really high dimensional space because there are so many of them and they appear very infrequently.
And in that regime, your model will learn compression.
To riff a little bit more on this, I think it's becoming increasingly clear.
I will say, I believe that the reason networks are so hard to interpret is
is because, in a large part, this superposition.
So if you take a model and you look at a given neuron in it, a given unit of computation, and you ask, how is this neuron contributing to the output of the model when it fires?
And you look at the data that it fires for, it's very confusing.
It'll be like 10% of every possible input, or like Chinese, but also fish, and trees, and the word, a full stop in URLs.