Andrej Karpathy
👤 PersonAppearances Over Time
Podcast Appearances
It's like the practically possible version with our technology and what we have available to us to get to a starting point where we can actually do things like reinforcement learning and so on.
So it's subtle, and I think you're right to push back on it.
But basically, the thing that pre-training is doing, so you're basically getting the next token predictor over the internet, and you're training that into a neural net.
It's doing two things actually that are kind of like unrelated.
Number one, it's picking up all this knowledge, as I call it.
Number two, it's actually becoming intelligent.
By observing the algorithmic patterns in the internet, it actually kind of like boots up all these like little circuits and algorithms inside the neural net to do things like in-context learning and all this kind of stuff.
And actually, you don't actually need or want the knowledge.
I actually think that's probably actually holding back the neural networks overall, because it's actually like getting them to rely on the knowledge a little too much sometimes.
For example, I kind of feel like agents, one thing they're not very good at is going off the data manifold of what exists on the internet.
If they had less knowledge or less memory, actually maybe they would be better.
And so what I think we have to do kind of going forward, and this would be part of the research paradigms, is actually think we need to start, we need to figure out ways to remove some of the knowledge and to keep what I call this cognitive core.
It's this like intelligent entity that is kind of stripped from knowledge but contains the algorithms and contains the magic, you know, of intelligence and problem solving and the strategies of it and all this kind of stuff.
I think I'm hesitant to say that in-context learning is not doing gradient descent because, I mean, it's not doing explicit gradient descent, but I still think that, so in-context learning, basically, it's pattern completion within a token window, right?
And it just turns out that there's a huge amount of patterns on the internet.
And so you're right, the model kind of like learns to complete the pattern, right?
And that's inside the weights.
The weights of the neural network are trying to discover patterns and complete the pattern.
And there's some kind of an adaptation that happens inside the neural network, right?
Which is kind of magical and just falls out from internet just because there's a lot of patterns.