Dwarkesh Patel
π€ SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
So it's performing the task people want, but at the same time, it's learning about the world from doing that task.
And do you imagine, okay, so we get rid of this paradigm where there's training periods and then there's deployment periods.
But then do we also get rid of this paradigm when there's the model and then instances of the model or copies of the model that are doing certain things?
How do you think about the fact that we'd want this thing to be doing different things?
We'd want to aggregate the knowledge that it's gaining from doing those different things.
I agree that the kind of thing you're talking about is necessary regardless of whether you start from LLMs or not, right?
If you want human or animal level intelligence, you're going to need this capability.
Suppose a human is trying to make a startup, right?
And this is a thing which has a reward on the order of 10 years.
Once in 10 years, you might have an exit where you get paid out a billion dollars.
But humans have this ability to make intermediate auxiliary rewards or have some way of, even when they have extremely sparse rewards, they can still make intermediate steps, having an understanding of like what the next thing they're doing leads to this grander goal we have.
And so how do you imagine such a process might play out with AIs?
right and then you also want some ability for information that you're learning i mean one of the things that makes humans quite different from these llms is that if you're onboarding on a job you're picking up so much context and information and that's what makes you useful at the job right you're uh everything from how your client as preferences to how the company works to everything
And is the bandwidth of information that you get from a procedure like TD learning high enough to have this huge pipe of context and tacit knowledge that you'd need to be picking up in the way humans do when they're just deployed?
Yeah.
So it seems to me you need two things.
One is some way of converting this long run goal reward into smaller auxiliary or, you know, these like predictive rewards of the future reward or the future reward, at least the final reward.
Then you need some other way.
Initially, it seems to me you need some way of then, OK, I'm
I need to hold on to all this context that I'm gaining as I'm working in the world, right?