Yann LeCun
👤 PersonAppearances Over Time
Podcast Appearances
That's called curiosity, basically, or play, right? When you play, you kind of explore part of the state space that you don't want to do for real because it might be dangerous, but you can adjust your world model without killing yourself, basically. So that's what you want to use RL for. When it comes time to learning a particular task,
You already have all the good representations, you already have your world model, but you need to adjust it for the situation at hand. That's when you use RL.
You already have all the good representations, you already have your world model, but you need to adjust it for the situation at hand. That's when you use RL.
You already have all the good representations, you already have your world model, but you need to adjust it for the situation at hand. That's when you use RL.
What's had the transformational effect is human feedback. There's many ways to use it, and some of it is just purely supervised, actually. It's not really reinforced by learning.
What's had the transformational effect is human feedback. There's many ways to use it, and some of it is just purely supervised, actually. It's not really reinforced by learning.
What's had the transformational effect is human feedback. There's many ways to use it, and some of it is just purely supervised, actually. It's not really reinforced by learning.
It's the HF. And then there is various ways to use human feedback, right? So you can ask humans to rate answers, multiple answers that are produced by a world model. And then what you do is you train an objective function to predict that rating. And then you can use that objective function to predict whether an answer is good.
It's the HF. And then there is various ways to use human feedback, right? So you can ask humans to rate answers, multiple answers that are produced by a world model. And then what you do is you train an objective function to predict that rating. And then you can use that objective function to predict whether an answer is good.
It's the HF. And then there is various ways to use human feedback, right? So you can ask humans to rate answers, multiple answers that are produced by a world model. And then what you do is you train an objective function to predict that rating. And then you can use that objective function to predict whether an answer is good.
And you can backpropagate gradient through this to fine-tune your system so that it only produces highly rated answers. That's one way. In RL, that means training what's called a reward model. Basically, a small neural net that estimates to what extent an answer is good.
And you can backpropagate gradient through this to fine-tune your system so that it only produces highly rated answers. That's one way. In RL, that means training what's called a reward model. Basically, a small neural net that estimates to what extent an answer is good.
And you can backpropagate gradient through this to fine-tune your system so that it only produces highly rated answers. That's one way. In RL, that means training what's called a reward model. Basically, a small neural net that estimates to what extent an answer is good.
It's very similar to the objective I was talking about earlier for planning, except now it's not used for planning, it's used for fine-tuning your system. I think it would be much more efficient to use it for planning, but currently it's used to fine-tune the parameters of the system. Now, there are several ways to do this. Some of them are supervised.
It's very similar to the objective I was talking about earlier for planning, except now it's not used for planning, it's used for fine-tuning your system. I think it would be much more efficient to use it for planning, but currently it's used to fine-tune the parameters of the system. Now, there are several ways to do this. Some of them are supervised.
It's very similar to the objective I was talking about earlier for planning, except now it's not used for planning, it's used for fine-tuning your system. I think it would be much more efficient to use it for planning, but currently it's used to fine-tune the parameters of the system. Now, there are several ways to do this. Some of them are supervised.
You just ask a human person, what is a good answer for this? Then you just type the answer. I mean, there's lots of ways that those systems are being adjusted.
You just ask a human person, what is a good answer for this? Then you just type the answer. I mean, there's lots of ways that those systems are being adjusted.
You just ask a human person, what is a good answer for this? Then you just type the answer. I mean, there's lots of ways that those systems are being adjusted.
I actually made that comment on just about every social network I can, and I've made that point multiple times in various forums. Here's my point of view on this. People can complain that AI systems are biased, and they generally are biased by the distribution of the training data that they've been using.