Dwarkesh Patel
π€ PersonAppearances Over Time
Podcast Appearances
It's something much more deliberate and rich is happening.
What is the ML analogy and how does that compare to what we're doing with other ones right now?
But you're so good at coming up with evocative phrases.
Sucking supervision through a straw is, like, so good.
Why hasn'tβso you're saying, like, your problem with outcome-based reward is that you have this huge trajectory, and then at the end, you're trying to learn every single possible thing about what you should do and what you should learn about the world from that one final bit.
Why hasn'tβgiven the fact that this is obviousβwhy hasn't process-based supervisionβ
as an alternative been a successful way to make models more capable?
What has been preventing us from using this alternative paradigm?
You're basically training the LLM to be a prompt injection model.
So to the extent you think this is the bottleneck to making RL more functional, then that will require making LLMs better judges if you want to do this in an automated way.
And then so is it just going to be like some sort of GAN-like approach where you had to train models to be more robust?
Interesting.
Do you have some shape of what the other idea could be?
Yeah.
So I guess I see a very, not easy, but like I can conceptualize how you would be able to train on synthetic examples or synthetic problems that you have made for yourself.
But there seems to be another thing humans do, maybe sleep is this, maybe daydreaming is this, which is not necessarily come up with fake problems, but just like reflect.
And I'm not sure what the ML analogy for, you know, daydreaming or sleeping, but just like just reflecting.
I haven't come up with a new problem.
Yeah, yeah.