Sholto Douglas
๐ค SpeakerAppearances Over Time
Podcast Appearances
By the way, for the audience, linear probe is you just, like, classify the activations.
I don't know.
From what I vaguely remember about the paper was, like, if it's, like, telling a lie, then you, like, you just train a classifier on, like, is it, yeah, in the end, was it a lie or was it just, like, wrong or something?
I don't know.
Yeah, it's, like, a classifier on activations.
And it's like... So you've done the projecting out to the million whatever features or something.
Mm-hmm.
Is a circuit, because maybe we're using feature and circuit interchangeably when they're not.
So is there like a deception?
But doesn't all this require you to have labels for all those examples?
And if you have those labels, then whatever faults that the linear probe has on the, maybe you've labeled a long thing or whatever, wouldn't the same thing apply to the labels you've come up with for the unsupervised features you've come up with?
And I guess the hope is you found a bunch of things that light up when it's being deceptive, and then you can figure out why some of those things are lighting up in this part of the distribution and not this other part and so forth.
Do you anticipate you'll be able to understand?
I don't know, the current models you've studied are pretty basic, right?
Do you think you'll be able to understand why GPT-7 fires in certain domains but not in other domains?
What is the highest level feature you've found so far?
Like it's basically for whatever it's like, maybe it's just like, um, and the symbolic species language, the book you recommended, there's like indexical, uh, things where you're just, I forgot what all the labels were, but like, there's things where you're just like, uh, you see a tiger and you're like run and whatever, you know, just like a very sort of behaviorist thing.
And then there's like a higher level of which, uh, what I refer to love, it refers to like a movie scene or my girlfriend or whatever, you know what I mean?
So it's like the top of the tent.
What is the highest level?