Yann LeCun
👤 PersonAppearances Over Time
Podcast Appearances
So contrastive method is you show an x and a bad y, and you tell the system, well, that's, you know, give a high energy to this, like push up the energy, right? Change the weights in the neural net that computes the energy so that it goes up. So that's contrasting methods.
The problem with this is if the space of Y is large, the number of such contrasting samples you're going to have to show is gigantic. But people do this. They do this when you train a system with RLHF. Basically what you're training is what's called a reward model, which is basically an objective function that tells you whether an answer is good or bad. And that's basically exactly what
The problem with this is if the space of Y is large, the number of such contrasting samples you're going to have to show is gigantic. But people do this. They do this when you train a system with RLHF. Basically what you're training is what's called a reward model, which is basically an objective function that tells you whether an answer is good or bad. And that's basically exactly what
The problem with this is if the space of Y is large, the number of such contrasting samples you're going to have to show is gigantic. But people do this. They do this when you train a system with RLHF. Basically what you're training is what's called a reward model, which is basically an objective function that tells you whether an answer is good or bad. And that's basically exactly what
what this is. So we already do this to some extent. We're just not using it for inference, we're just using it for training. There is another set of methods which are non-contrastive, and I prefer those. And those non-contrastive methods basically say, okay, the energy function needs to have low energy on pairs of x, y's that are compatible, that come from your training set.
what this is. So we already do this to some extent. We're just not using it for inference, we're just using it for training. There is another set of methods which are non-contrastive, and I prefer those. And those non-contrastive methods basically say, okay, the energy function needs to have low energy on pairs of x, y's that are compatible, that come from your training set.
what this is. So we already do this to some extent. We're just not using it for inference, we're just using it for training. There is another set of methods which are non-contrastive, and I prefer those. And those non-contrastive methods basically say, okay, the energy function needs to have low energy on pairs of x, y's that are compatible, that come from your training set.
How do you make sure that the energy is going to be higher everywhere else? And the way you do this is by having a regularizer, a criterion, a term in your cost function that basically minimizes the volume of space that can take low energy. And the precise way to do this is all kinds of different specific ways to do this, depending on the architecture. But that's the basic principle.
How do you make sure that the energy is going to be higher everywhere else? And the way you do this is by having a regularizer, a criterion, a term in your cost function that basically minimizes the volume of space that can take low energy. And the precise way to do this is all kinds of different specific ways to do this, depending on the architecture. But that's the basic principle.
How do you make sure that the energy is going to be higher everywhere else? And the way you do this is by having a regularizer, a criterion, a term in your cost function that basically minimizes the volume of space that can take low energy. And the precise way to do this is all kinds of different specific ways to do this, depending on the architecture. But that's the basic principle.
So that if you push down the energy function for particular regions in the XY space, it will automatically go up in other places because there's only a limited volume of space that can take low energy by the construction of the system or by the regularizing function.
So that if you push down the energy function for particular regions in the XY space, it will automatically go up in other places because there's only a limited volume of space that can take low energy by the construction of the system or by the regularizing function.
So that if you push down the energy function for particular regions in the XY space, it will automatically go up in other places because there's only a limited volume of space that can take low energy by the construction of the system or by the regularizing function.
Yeah, so you can do this with language directly by just, you know, x is a text and y is a continuation of that text. Yes. Or x is a question, y is the answer.
Yeah, so you can do this with language directly by just, you know, x is a text and y is a continuation of that text. Yes. Or x is a question, y is the answer.
Yeah, so you can do this with language directly by just, you know, x is a text and y is a continuation of that text. Yes. Or x is a question, y is the answer.
Well, no, it depends on how the internal structure of the system is built. If the internal structure of the system is built in such a way that inside of the system there is a latent variable, let's call it z, that... you can manipulate so as to minimize the output energy. Then that Z can be viewed as a representation of a good answer that you can translate into a Y that is a good answer.
Well, no, it depends on how the internal structure of the system is built. If the internal structure of the system is built in such a way that inside of the system there is a latent variable, let's call it z, that... you can manipulate so as to minimize the output energy. Then that Z can be viewed as a representation of a good answer that you can translate into a Y that is a good answer.
Well, no, it depends on how the internal structure of the system is built. If the internal structure of the system is built in such a way that inside of the system there is a latent variable, let's call it z, that... you can manipulate so as to minimize the output energy. Then that Z can be viewed as a representation of a good answer that you can translate into a Y that is a good answer.
Very similar way, but you have to have this way of preventing collapse, of ensuring that there is high energy for things you don't train it on. And currently it's very implicit in LLM. It's done in a way that people don't realize is being done, but it is being done. It's due to the fact that when you give a high probability to a word,