Andrej Karpathy
๐ค SpeakerAppearances Over Time
Podcast Appearances
You're given a math problem, and you're trying to find a solution.
Now, in reinforcement learning, you will try lots of things in parallel first.
So you're given a problem, you try hundreds of things,
different attempts.
And these attempts can be complex, right?
They can be like, oh, let me try this, let me try that, this didn't work, that didn't work, et cetera.
And then maybe you get an answer.
And now you check the back of the book and you see, okay, the correct answer is this.
And then you can see that, okay, this one, this one, and that one got the correct answer, but these other 97 of them didn't.
So literally what reinforcement learning does is it goes to the ones that worked really well, and every single thing you did along the way, every single token gets up-weighted of, like, do more of this.
The problem with that is, I mean, people will say that your estimator has high variance, but, I mean, it's just noisy.
It's noisy.
So basically, it kind of almost assumes that every single little piece of the solution that you made that right at the right answer was the correct thing to do, which is not true.
Like, you may have gone down the wrong alleys
until you write the right solution.
Every single one of those incorrect things you did, as long as you got to the correct solution, will be up-weighted as do more of this.
It's terrible.
It's noise.
You've done all this work only to find a single, at the end, you get a single number of like, oh, you did correct.
And based on that, you weigh that entire trajectory as like up-weight or down-weight.