Trenton Bricken
👤 PersonAppearances Over Time
Podcast Appearances
And I mean, even with the 4.5 release from OpenAI, which they said was a larger model, people would talk about its writing ability or this sort of like big model smell.
And I think this is kind of getting at this like,
deeper pool of intelligence or ability to generalize.
I mean, all of the interpretability work on superposition states that the models are always underparameterized and they're being forced to cram as much information in as they possibly can.
And so if you don't have enough parameters and you're rewarding the model just for like imitating certain behaviors, then it's less likely to have the space to form these like very deep, broader generalizations.
Yeah, yeah, yeah.
So yeah, in the circuits work, I mean, even with the Golden Gate Bridge, and by the way, this is a...
a cable from the Golden Gate Bridge that the team acquired.
They had to destabilize the bridge in order to get this.
But Claude will fix it.
Claude loves the Golden Gate Bridge.
So even with this, for people who aren't familiar, we made Golden Gate Claude when we released our paper scaling monosemanticity.
where one of the 30 million features was for the Golden Gate Bridge.
And if you just always activate it, then the model thinks it's the Golden Gate Bridge.
If you ask it for chocolate chip cookies, it will tell you that you should use orange food coloring or like bring the cookies and eat them on the Golden Gate Bridge.
All of these sort of associations.
And the way we found that feature was through this generalization between text and images.
So...
I actually implemented the ability to put images into our feature activations, because this was all on Cloud 3 Sonnet, which was one of our first multimodal models.
So we only trained the sparse autoencoder and the features on text.