Dwarkesh Patel
π€ SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
This process is more analogous to imitation learning than it is to RL from scratch.
Now, of course, are we literally predicting the next token like an LLM would in order to do this cultural learning?
No, of course not.
So even the imitation learning that humans are doing is not like the supervised learning that we do for pre-training LLMs.
But neither are we running around trying to collect some well-defined scale or reward.
No ML learning regime perfectly describes human learning or animal learning.
We're doing things which are both analogous to RL and to supervised learning.
What planes are to birds, supervised learning might end up being to human cultural learning.
I also don't think these learning techniques are actually categorically different.
Imitation learning is just short horizon RL.
The episode is a token long.
The LLM is making a conjecture about the next token based on its understanding of the world and how the different pieces of information in the sequence relate to each other.
And it receives reward in proportion to how well it predicted the next token.
Now, of course, I already hear people saying, no, no, that's not the ground truth.
It's just learning what a human was likely to say.
But there's a different question, which I think is actually more relevant to understanding the scalability of these models.
And that question is, can we leverage this imitation learning to help models learn better from ground truth?
And I think the answer is obviously yes.
After RRLing these pre-trained base models, we've gotten them to win gold in international Math Olympiad competitions and to code up entire working applications from scratch.