Andrej Karpathy
👤 PersonAppearances Over Time
Podcast Appearances
I will say that...
There have been some papers that I thought were interesting that actually look at the mechanisms behind in-context learning.
And I do think it's possible that in-context learning actually runs a small gradient descent loop internally in the layers of the neural network.
And so I recall one paper in particular where they were doing linear regression, actually, using in-context learning.
So basically, your inputs into the neural network are XY pairs.
x, y, x, y, x, y that happened to be on the line.
And then you do x and you expect the y. And the neural network, when you train it in this way, actually does do linear regression.
And normally when you would run linear regression, you have a small gradient descent optimizer that basically looks at x, y, looks at an error, calculates the gradient of the weights, and does the update a few times.
It just turns out that when they looked at the weights of that in-context learning algorithm,
they actually found some analogies to gradient descent mechanics.
In fact, I think even the paper was stronger because they actually hard-coded the weights of a neural network to do gradient descent through attention and all the internals of the neural network.
So I guess that's just my only pushback is that who knows how in-context learning works, but I actually think that it's probably doing a little bit of some kind of funky gradient descent internally, and that I think that that's possible.
So I guess I was only pushing back on you're saying it's not doing in-context learning.
Who knows what it's doing, but it's probably maybe doing something similar to it, but we don't know.
I think I kind of agree.
I mean, the way I usually put this is that anything that happens during the training of the neural network, the knowledge is only kind of like a hazy recollection of what happened in the training time.
And that's because the compression is dramatic.
You're taking 15 trillion tokens and you're compressing it to just your final network of a few billion parameters.
So obviously it's a massive amount of compression going on.
So I kind of refer to it as like a hazy recollection of the internet documents.