Sholto Douglas
๐ค SpeakerAppearances Over Time
Podcast Appearances
Isn't that when you have generalization, like grokking happens in that regime, right?
Another question.
So the distilled models, like...
First of all, okay, so what is happening there?
Because the earlier claims we're talking about is the smaller models are worse at learning than bigger models.
But like GPT-4 Turbo, you could make the claim that actually GPT-4 Turbo is worse at reasoning style stuff than GPT-4, but probably knows the same facts, like the distillation got rid of some of the reasoning things.
Oh, okay.
Yeah.
Yeah.
What is the, how do you like interpret what's happening in distillation?
I think Warren had one of these questions on his website.
Why can't you train the distilled model directly?
Why does it have to go through?
Is it, is a picture like you had to project it from this bigger space to a smaller space?
I don't remember, but do you know?
Yeah, exactly, exactly.
Yep, yep.
Just to make sure the audience got that, when you're training on a distilled model, you're like, you see all its probabilities over the tokens it was predicting and then over the ones you were predicting and then you like update through all those probabilities rather than just seeing the last word and updating on that.
Okay, so this actually raises a question I was intending to ask you.
Right now, I think you were the one who mentioned you can think of chain of thought as adaptive compute of like...