Sholto Douglas
๐ค SpeakerAppearances Over Time
Podcast Appearances
And then the feature for like, you know, I guess maybe it's true because like the mass is like a gradient and like, you know, like, I don't know, but the polarity or whatever is a gradient as well.
But there's also a sense in which like there's the laws and the laws are more general and you have to understand like the general laws.
What is a compelling explanation to you, especially for very smart models of
Um, like I understand why it made this output and it was like for a legit reason.
If it's doing million line pull requests or something, what are you seeing at the end of that request where you're like, yep, that's chill.
But before I trace down on that, um,
What does the reasoning circuit look like?
What would that look like when you found it?
Something is happening where when you pick up a new game and you immediately start understanding how to play it, and it doesn't seem like an Induction Heads kind of thing.
What would that... Because Induction Heads is like one-layer transformers.
Either two layers, yeah.
So you can kind of see the thing that is a human picks up a new game and understands it.
How would you think about what that is?
Presumably it's across multiple layers, but what would that physically look like?
How big would it be maybe?
And, but like, is the story of how you found it with the reasoning thing is like, cause you won't be able to understand, or it'll just be like really, you know, it won't be something you can see in like a two layer transformer.
So will you just be like,
the circuit for deception or whatever it just this this part of the network fired when we at the end identified the thing as being deceptive this part and it didn't fire when we didn't identify it as being deceptive therefore this must be the deception circuit
And that requires us at the end to be able to label which one is like bad and which one is good.
Although it's doing that by, I don't know, Chad GPT, I think it's probably modeling me because that's like what RLHF induces them to.