Andrej Karpathy
๐ค SpeakerAppearances Over Time
Podcast Appearances
So incredible.
And that was like two years, three years of work.
And now came RL.
And RL allows you to do a bit better than just imitation learning, right?
Because you can't have these reward functions and you can hill climb on the reward functions.
And so some problems have just correct answers.
You can hill climb on that without getting expert trajectories to imitate.
So that's amazing.
And the model can also discover solutions that the human might never come up with.
So this is incredible.
And yet, it's still stupid.
So I think we need more.
And so I saw a paper from Google yesterday that tried to have this reflect and review page idea in mind.
What was the memory bank paper or something?
I don't know.
I've actually seen a few papers along these lines.
So I expect there to be some kind of a major update to how we do algorithms for LLMs coming in that realm.
And then I think we need three or four or five more.
Something like that.
So process-based supervision just refers to the fact that we're not going to have a reward function only at the very end of after you've made 10 minutes of work, I'm not going to tell you you did well or not well.