Trenton Bricken
๐ค SpeakerAppearances Over Time
Podcast Appearances
So it's like, as our models continue to get more capable, having them assign labels or like run some of these experiments at scale.
And then with respect to like,
If there's superhuman performance, how do you detect it?
Which I think was kind of the last part of your question.
Aside from the cop-out answer, if we buy this associations all the way down, you should be able to coarse-grain the representations at a certain level such that they then make sense.
I think it was even in Demis's podcast, he's talking about if a chess player makes a superhuman move, they should be able to distill it into reasons why they did it.
And even if the model's not gonna tell you what it is, you should be able to decompose that complex behavior into simpler circuits or features to really start to make sense of why it did the thing that it did.
Yes and no.
So like we are actively trying to use dictionary learning now on the sleeper agents work, which we talked about earlier.
And it's like, if I just give you a model, can you tell me if there's this trigger and it's going to start doing interesting behavior?
And it's an open question whether or not when it learns that behavior, it's part of a more general circuit.
that we can pick up on without actually getting activations for and having it display that behavior, right?
Because that would kind of be cheating then.
Or if it's learning some hacky trick over, like that's a separate circuit that you'll only pick up on if you actually have it do that behavior.
But even in that case, the geometry of features gets really interesting because fundamentally each feature is in some part of your representation space.
And they all exist with respect to each other.
And so in order to have this new behavior, you need to carve out some subset of the feature space for the new behavior and then push everything else out of the way to make space for it.
So hypothetically, you can imagine you have your model before you've taught it this bad behavior.
you know all the features or have some coarse-grained representation of them.
You then fine-tune it such that it becomes malicious.