Dwarkesh Patel
π€ SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
And it clearly doesn't exist with LLMs trained on RLVR.
But there might be some other relatively straightforward ways to shoehorn continual learning atop LLMs.
For example, one could imagine making supervised fine tuning a tool call for the model.
So the outer loop RL is incentivizing the model to teach itself effectively using supervised learning in order to solve problems that don't fit in the context window.
Now, I'm genuinely agnostic about how well techniques like this will work.
I'm not an AI researcher.
but I wouldn't be surprised if they basically replicate continual learning.
And the reason is that models are already demonstrating something resembling human continual learning within their context windows.
The fact that in-context learning emerged spontaneously from the training incentive to process long sequences makes me think that if information could just flow across windows longer than the context limit, then models could meta-learn the same flexibility that they already show in context.
Okay, some concluding thoughts.
Evolution does meta-RL to make an RL agent, and that agent can selectively do imitation learning.
With LLMs, we're going the opposite way.
We have first made this base model that does pure imitation learning, and then we're hoping that we do enough RL on it to make a coherent agent with goals and self-awareness.
Maybe this won't work.
But I don't think these super first principles arguments about, for example, how these LMs don't have a true world model are actually proving much.
And I also don't think they're strictly accurate for the models we have today, which are actually undergoing a lot of RL on ground truth.
Even if Sutton's platonic ideal doesn't end up being the path to the first AGI,
His first principles critique is identifying some genuine basic gaps that these models have.
And we don't even notice them because they're so pervasive in the current paradigm, but because he has this decades-long perspective, they're obvious to him.
It's the lack of continual learning.