Andrej Karpathy
๐ค SpeakerAppearances Over Time
Podcast Appearances
I don't know that I fully resonate with that because I feel like these models, when you boot them up and they have zero tokens in the window, they're always like restarting from scratch where they were.
So I don't actually know in that worldview what it looks like because, again, maybe making some analogies to humans just because I think it's roughly concrete and kind of interesting to think through.
I feel like when I'm awake, I'm building up a context window of stuff that's happening during the day.
But I feel like when I go to sleep, something magical happens where I don't actually think that that context window stays around.
I think there's some process of distillation into weights of my brain.
And this happens during sleep and all this kind of stuff.
We don't have an equivalent of that in large language models.
And that's, to me, more adjacent to when you talk about continual learning and so on as absent.
These models don't really have this distillation phase of taking what happened, analyzing it, obsessively thinking through it, basically doing some kind of a synthetic data generation process and distilling it back into the weights, and maybe having a specific neural net per person.
Maybe it's a LoRa, it's not a full...
Yeah, it's not a full-weight neural network.
It's just some of the small sparse subset of the weights are changed.
But basically, we do want to create ways of creating these individuals that have very long contacts.
It's not only remaining in the contacts window because the contacts windows grow very, very long.
Maybe we have some very elaborate sparse attention over it.
But I still think that humans obviously have some process for distilling some of that knowledge into the weights.
We're missing it.
And I do also think that humans have some kind of a very elaborate sparse attention scheme, which I think we're starting to see some early hints of.
So DeepSeek v3.2 just came out, and I saw that they have like a sparse attention as an example.
And this is one way to have very, very long context windows.