Andrej Karpathy
๐ค SpeakerAppearances Over Time
Podcast Appearances
And then maybe you get an answer.
And now you check the back of the book and you see, okay, the correct answer is this.
And then you can see that, okay, this one, this one, and that one got the correct answer, but these other 97 of them didn't.
So literally what reinforcement learning does is it goes to the ones that worked really well, and every single thing you did along the way, every single token gets up-weighted of, like, do more of this.
The problem with that is, I mean, people will say that your estimator has high variance, but, I mean, it's just noisy.
It's noisy.
So basically, it kind of almost assumes that every single little piece of the solution that you made that right at the right answer was the correct thing to do, which is not true.
Like, you may have gone down the wrong alleys
until you write the right solution.
Every single one of those incorrect things you did, as long as you got to the correct solution, will be up-weighted as do more of this.
It's terrible.
It's noise.
You've done all this work only to find a single, at the end, you get a single number of like, oh, you did correct.
And based on that, you weigh that entire trajectory as like up-weight or down-weight.
And so the way I like to put it is you're sucking supervision through a straw because you've done all this work that could be a minute to roll out and you're like sucking the bits of supervision of the final reward signal through a straw and you're like putting it, you're like, you're basically like, yeah, you're broadcasting that across the entire trajectory and using that to upweigh or downweigh that trajectory.
It's crazy.
A human would never do this.
Number one, a human would never do hundreds of rollouts.
Number two, when a person sort of finds a solution, they will have a pretty complicated process of review of like, okay, I think these parts that I did well, these parts I did not do that well.
I should probably do this or that.