Trenton Bricken
๐ค SpeakerAppearances Over Time
Podcast Appearances
So it's one head's job to just always copy the token that came before, for example, or the token that came five before or whatever.
And then it's another head's job to be like, no, do not copy that thing.
So there are lots of different circuits performing in these cases, pretty basic operations, but when they're chained together, you can get unique behaviors.
I think a lot of analysis like that.
Like Anthropic has done quite a bit of research before on sycophancy, which is like the model saying what it thinks you want to hear.
Yeah.
So we have tons of instances.
And actually, as you make models larger, they do more of this where the model is clearly it has like features that model another person's mind.
And these activate and like some subset of these we're hypothesizing here, but like would be associated with more deceptive behavior.
Yeah, so a lot to unpack here.
So I guess a few things.
One, it's fantastic that these models are deterministic.
When you sample from them, it's stochastic, right?
But I can just keep putting in more inputs and ablate every single part of the model.
This is kind of the pitch for computational neuroscientists to come and work on interpretability.
you have this alien brain and you have access to everything in it and you can just ablate however much of it you want.
And so I think if you do this carefully enough, you really can start to pin down what are the circuits involved?
What are the backup circuits?
These sorts of things.
The kind of cop-out answer here, but it's important to keep in mind is doing automated interpretability.