Yann LeCun
๐ค SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
what this is. So we already do this to some extent. We're just not using it for inference, we're just using it for training. There is another set of methods which are non-contrastive, and I prefer those. And those non-contrastive methods basically say, okay, the energy function needs to have low energy on pairs of x, y's that are compatible, that come from your training set.
what this is. So we already do this to some extent. We're just not using it for inference, we're just using it for training. There is another set of methods which are non-contrastive, and I prefer those. And those non-contrastive methods basically say, okay, the energy function needs to have low energy on pairs of x, y's that are compatible, that come from your training set.
what this is. So we already do this to some extent. We're just not using it for inference, we're just using it for training. There is another set of methods which are non-contrastive, and I prefer those. And those non-contrastive methods basically say, okay, the energy function needs to have low energy on pairs of x, y's that are compatible, that come from your training set.
How do you make sure that the energy is going to be higher everywhere else? And the way you do this is by having a regularizer, a criterion, a term in your cost function that basically minimizes the volume of space that can take low energy. And the precise way to do this is all kinds of different specific ways to do this, depending on the architecture. But that's the basic principle.
How do you make sure that the energy is going to be higher everywhere else? And the way you do this is by having a regularizer, a criterion, a term in your cost function that basically minimizes the volume of space that can take low energy. And the precise way to do this is all kinds of different specific ways to do this, depending on the architecture. But that's the basic principle.
How do you make sure that the energy is going to be higher everywhere else? And the way you do this is by having a regularizer, a criterion, a term in your cost function that basically minimizes the volume of space that can take low energy. And the precise way to do this is all kinds of different specific ways to do this, depending on the architecture. But that's the basic principle.
So that if you push down the energy function for particular regions in the XY space, it will automatically go up in other places because there's only a limited volume of space that can take low energy by the construction of the system or by the regularizing function.
So that if you push down the energy function for particular regions in the XY space, it will automatically go up in other places because there's only a limited volume of space that can take low energy by the construction of the system or by the regularizing function.
So that if you push down the energy function for particular regions in the XY space, it will automatically go up in other places because there's only a limited volume of space that can take low energy by the construction of the system or by the regularizing function.
Yeah, so you can do this with language directly by just, you know, x is a text and y is a continuation of that text. Yes. Or x is a question, y is the answer.
Yeah, so you can do this with language directly by just, you know, x is a text and y is a continuation of that text. Yes. Or x is a question, y is the answer.
Yeah, so you can do this with language directly by just, you know, x is a text and y is a continuation of that text. Yes. Or x is a question, y is the answer.
Well, no, it depends on how the internal structure of the system is built. If the internal structure of the system is built in such a way that inside of the system there is a latent variable, let's call it z, that... you can manipulate so as to minimize the output energy. Then that Z can be viewed as a representation of a good answer that you can translate into a Y that is a good answer.
Well, no, it depends on how the internal structure of the system is built. If the internal structure of the system is built in such a way that inside of the system there is a latent variable, let's call it z, that... you can manipulate so as to minimize the output energy. Then that Z can be viewed as a representation of a good answer that you can translate into a Y that is a good answer.
Well, no, it depends on how the internal structure of the system is built. If the internal structure of the system is built in such a way that inside of the system there is a latent variable, let's call it z, that... you can manipulate so as to minimize the output energy. Then that Z can be viewed as a representation of a good answer that you can translate into a Y that is a good answer.
Very similar way, but you have to have this way of preventing collapse, of ensuring that there is high energy for things you don't train it on. And currently it's very implicit in LLM. It's done in a way that people don't realize is being done, but it is being done. It's due to the fact that when you give a high probability to a word,
Very similar way, but you have to have this way of preventing collapse, of ensuring that there is high energy for things you don't train it on. And currently it's very implicit in LLM. It's done in a way that people don't realize is being done, but it is being done. It's due to the fact that when you give a high probability to a word,
Very similar way, but you have to have this way of preventing collapse, of ensuring that there is high energy for things you don't train it on. And currently it's very implicit in LLM. It's done in a way that people don't realize is being done, but it is being done. It's due to the fact that when you give a high probability to a word,
automatically you give low probability to other words because you only have a finite amount of probability to go around right there, to sum to one.
automatically you give low probability to other words because you only have a finite amount of probability to go around right there, to sum to one.