Dwarkesh Patel
π€ SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
It sounds like you're saying that when we do have generalization in these models, that is a result of some sculpted... Humans did it.
Yeah.
I'm not trying to kickstart this initial crux again, but I'm just genuinely curious because I think I might be using the term differently.
I mean, one way to think about it is these LLMs are increasing the scope of generalization from like earlier systems, which could not really even do a basic math problem to now they can do anything in this class of math Olympia type problems, right?
So you initially start with like they can generalize among addition problems, at least.
Then you generalize to like they can generalize among like problems that require use of different kinds of mathematical techniques and theorems and conceptual categories, which is like what the math Olympiad requires.
And so it sounds like you don't think of being able to solve any problem within that category necessarily.
as an example of generalization?
Or let me know if I'm misunderstanding that.
My understanding is that this is working better and better with coding agents.
So engineers, obviously, if you're trying to program a library,
There's many different ways you could achieve the end spec.
And an initial frustration with these models has been that they'll do it in a way that's sloppy.
And then over time, they're getting better and better at coming up with the design architecture and the abstractions that developers find more satisfying.
And it seems an example of what you're talking about.
So to prep for this interview, I wanted to understand the full history of RL, starting with reinforce up to current techniques like GRPO.
And I didn't just want a list of equations and algorithms.
I wanted to really understand each change in this progression and the underlying motivation.
You know, what was the main problem that each successive method was actually trying to solve?
So I had Gemini Deep Research walk me through this entire timeline step by step.