Andrej Karpathy
๐ค SpeakerAppearances Over Time
Podcast Appearances
You can hill climb on that without getting expert trajectories to imitate.
So that's amazing.
And the model can also discover solutions that the human might never come up with.
So this is incredible.
And yet, it's still stupid.
So I think we need more.
And so I saw a paper from Google yesterday that tried to have this reflect and review page idea in mind.
What was the memory bank paper or something?
I don't know.
I've actually seen a few papers along these lines.
So I expect there to be some kind of a major update to how we do algorithms for LLMs coming in that realm.
And then I think we need three or four or five more.
Something like that.
So process-based supervision just refers to the fact that we're not going to have a reward function only at the very end of after you've made 10 minutes of work, I'm not going to tell you you did well or not well.
I'm going to tell you at every single step of the way how well you're doing.
And this is basically the reason we don't have that.
It's tricky how you do that properly because you have partial solutions and you don't know how to assign credit.
So when you get the right answer, it's just an equality match to the answer.
Very simple to implement.
If you're doing basically process supervision, how do you assign, in an automatable way, partial credit assignment?