Andrej Karpathy
๐ค SpeakerAppearances Over Time
Podcast Appearances
And so the way I like to put it is you're sucking supervision through a straw because you've done all this work that could be a minute to roll out and you're like sucking the bits of supervision of the final reward signal through a straw and you're like putting it, you're like, you're basically like, yeah, you're broadcasting that across the entire trajectory and using that to upweigh or downweigh that trajectory.
It's crazy.
A human would never do this.
Number one, a human would never do hundreds of rollouts.
Number two, when a person sort of finds a solution, they will have a pretty complicated process of review of like, okay, I think these parts that I did well, these parts I did not do that well.
I should probably do this or that.
And they think through things.
There's nothing in current LLMs that does this.
There's no equivalent of it.
But I do see papers popping out that are trying to do this because it's obvious to everyone in the field.
So I kind of see it as like, the first imitation learning actually, by the way, was extremely surprising and miraculous and amazing that we can fine-tune by imitation in humans.
And that was incredible.
Because in the beginning, all we had was base models.
Base models are autocomplete.
And it wasn't obvious to me at the time, and I had to learn this, and the paper that blew my mind was InstructGPT.
Because it pointed out that, hey, you can take the pre-trained model, which is autocomplete,
And if you just fine-tune it on text that looks like conversations, the model will very rapidly adapt to become very conversational.
And it keeps all the knowledge from pre-training.
And this blew my mind because I didn't understand that this just like stylistically can adjust so quickly and become an assistant to a user through just a few loops of fine-tuning on that kind of data.
It was very miraculous to me that that worked.