Dwarkesh Patel
π€ SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
Imitation learning is just short horizon RL.
The episode is a token long.
The LLM is making a conjecture about the next token based on its understanding of the world and how the different pieces of information in the sequence relate to each other.
And it receives reward in proportion to how well it predicted the next token.
Now, of course, I already hear people saying, no, no, that's not the ground truth.
It's just learning what a human was likely to say.
But there's a different question, which I think is actually more relevant to understanding the scalability of these models.
And that question is, can we leverage this imitation learning to help models learn better from ground truth?
And I think the answer is obviously yes.
After RRLing these pre-trained base models, we've gotten them to win gold in international Math Olympiad competitions and to code up entire working applications from scratch.
Now, these are ground truth examinations.
Can you solve this unseen Math Olympiad question?
Can you build this application to match the specific features request?
But you couldn't have RL'd a model to accomplish these tasks from scratch, or at least we don't know how to do that yet.
You needed a reasonable prior over human data in order to kickstart this RL process.
Whether you want to call this prior a proper world model or just a model of humans, I don't think is that important.
It honestly seems like a semantic debate.
Because what you really care about is whether this model of humans has...
helps you start learning from ground truth, aka become a true world model.