Sholto Douglas
๐ค SpeakerAppearances Over Time
Podcast Appearances
Do you want to explain what feature splitting is?
So give an example, potentially, of that.
Okay, so let's go back to GPT-7.
First of all, is this a sort of like linear tax on any model to figure out?
Even before that, is this a one-time thing you had to do or is this the kind of thing you have to do on every output?
Or is it just like one time, it's not deceptive, we're good to go roll?
Actually, yeah, let me let you answer that.
For the audience, weights are
I don't know if permanent is the right word, but they are the model itself, whereas activations are the artifacts of any single call.
So there's going to be two steps to this for GPT-7 or whatever model we're concerned about.
First, correct me if I'm wrong, but training the sparse autoencoder and do the unsupervised projection into a wider space of features that have a higher fidelity to what is actually happening in the model.
And then secondly, label those features.
Because let's say like the cost of training the model is N. What will those two steps cost relative to N?
Although given the way that these features are not organized in, um,
things that are intuitive for humans, right?
Like, cause we just don't have to deal with basics before, so we don't have that many, you know, we just don't dedicate that much, like whatever, firmware to like deconstructing, which kind of basics before it is.
How would we know that the subjects, and this will go back to maybe the MOE discussion we'll have of,
I guess we might as well talk about it, but like in mixture of experts, the mixture of paper talked about how they couldn't find the experts weren't specialized in a way that we could understand.
There's not like a chemistry expert or a physics expert or something.
So why would you think that like it will be like biology feature and then deconstruct rather than like blah and then you just deconstruct and it's like anthrax and you're like shoes and whatever.