Yann LeCun
👤 PersonAppearances Over Time
Podcast Appearances
Very similar way, but you have to have this way of preventing collapse, of ensuring that there is high energy for things you don't train it on. And currently it's very implicit in LLM. It's done in a way that people don't realize is being done, but it is being done. It's due to the fact that when you give a high probability to a word,
Very similar way, but you have to have this way of preventing collapse, of ensuring that there is high energy for things you don't train it on. And currently it's very implicit in LLM. It's done in a way that people don't realize is being done, but it is being done. It's due to the fact that when you give a high probability to a word,
automatically you give low probability to other words because you only have a finite amount of probability to go around right there, to sum to one.
automatically you give low probability to other words because you only have a finite amount of probability to go around right there, to sum to one.
automatically you give low probability to other words because you only have a finite amount of probability to go around right there, to sum to one.
So when you minimize the cross entropy or whatever, when you train your LLM to predict the next word, you're increasing the probability your system will give to the correct word, but you're also decreasing the probability it will give to the incorrect words. Now,
So when you minimize the cross entropy or whatever, when you train your LLM to predict the next word, you're increasing the probability your system will give to the correct word, but you're also decreasing the probability it will give to the incorrect words. Now,
So when you minimize the cross entropy or whatever, when you train your LLM to predict the next word, you're increasing the probability your system will give to the correct word, but you're also decreasing the probability it will give to the incorrect words. Now,
Indirectly, that gives a high probability to sequences of words that are good and low probability to sequences of words that are bad, but it's very indirect. It's not obvious why this actually works at all, because you're not doing it on a joint probability of all the symbols in a sequence.
Indirectly, that gives a high probability to sequences of words that are good and low probability to sequences of words that are bad, but it's very indirect. It's not obvious why this actually works at all, because you're not doing it on a joint probability of all the symbols in a sequence.
Indirectly, that gives a high probability to sequences of words that are good and low probability to sequences of words that are bad, but it's very indirect. It's not obvious why this actually works at all, because you're not doing it on a joint probability of all the symbols in a sequence.
You're just doing it to factorize that probability in terms of conditional probabilities over successive tokens.
You're just doing it to factorize that probability in terms of conditional probabilities over successive tokens.
You're just doing it to factorize that probability in terms of conditional probabilities over successive tokens.
So we've been doing this with Ojepa architectures, basically. The joint embedding. Ojepa. So there, the compatibility between two things is, here's an image or a video, here's a corrupted, shifted, or transformed version of that image or video, or masked. And then the energy of the system is the prediction error of the representation
So we've been doing this with Ojepa architectures, basically. The joint embedding. Ojepa. So there, the compatibility between two things is, here's an image or a video, here's a corrupted, shifted, or transformed version of that image or video, or masked. And then the energy of the system is the prediction error of the representation
So we've been doing this with Ojepa architectures, basically. The joint embedding. Ojepa. So there, the compatibility between two things is, here's an image or a video, here's a corrupted, shifted, or transformed version of that image or video, or masked. And then the energy of the system is the prediction error of the representation
the predicted representation of the good thing versus the actual representation of the good thing. So you run the corrupted image to the system, predict the representation of the good input, uncorrupted, and then compute the prediction error. That's the energy of the system. So this system will tell you this is a good representation
the predicted representation of the good thing versus the actual representation of the good thing. So you run the corrupted image to the system, predict the representation of the good input, uncorrupted, and then compute the prediction error. That's the energy of the system. So this system will tell you this is a good representation
the predicted representation of the good thing versus the actual representation of the good thing. So you run the corrupted image to the system, predict the representation of the good input, uncorrupted, and then compute the prediction error. That's the energy of the system. So this system will tell you this is a good representation