Richard Sutton
👤 PersonAppearances Over Time
Podcast Appearances
Right.
But in a continual learning setup, it just goes into the weights.
Maybe, yeah, so maybe context is the wrong word to use, because I mean a more general thing.
You learn a policy that's specific to the environment that you're finding yourself in.
So maybe we're trying to ask the question of, it seems like the reward is too small of a thing to do all the learning that we need to do.
But, of course, we have the sensations, right?
We have all the other information we can learn from.
Right.
We don't just learn from the reward.
We learn from all the data.
So now I want to talk about the base common model of the agent with the four parts.
Right.
So we need a policy.
The policy says...
In the situation I'm in, what should I do?
We need a value function.
The value function is the thing that is learned with TD learning.
And the value function produces a number.
The number says, how well is it going?
And then you watch if that's going up and down and use that to adjust your policy.