Andrej Karpathy
๐ค SpeakerAppearances Over Time
Podcast Appearances
We don't have an equivalent of that in large language models.
And that's, to me, more adjacent to when you talk about continual learning and so on as absent.
These models don't really have this distillation phase of taking what happened, analyzing it, obsessively thinking through it, basically doing some kind of a synthetic data generation process and distilling it back into the weights, and maybe having a specific neural net per person.
Maybe it's a LoRa, it's not a full...
Yeah, it's not a full-weight neural network.
It's just some of the small sparse subset of the weights are changed.
But basically, we do want to create ways of creating these individuals that have very long contacts.
It's not only remaining in the contacts window because the contacts windows grow very, very long.
Maybe we have some very elaborate sparse attention over it.
But I still think that humans obviously have some process for distilling some of that knowledge into the weights.
We're missing it.
And I do also think that humans have some kind of a very elaborate sparse attention scheme, which I think we're starting to see some early hints of.
So DeepSeek v3.2 just came out, and I saw that they have like a sparse attention as an example.
And this is one way to have very, very long context windows.
So I almost feel like we are redoing a lot of the...
cognitive tricks that evolution came up with through a very different process.
But we're, I think, going to converge on a similar architecture cognitively.
Well, the way I like to think about it is, okay, let's translation invariance in time, right?
So 10 years ago, where were we?
2015, we had convolutional neural networks primarily.