Yann LeCun
๐ค SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
automatically you give low probability to other words because you only have a finite amount of probability to go around right there, to sum to one.
So when you minimize the cross entropy or whatever, when you train your LLM to predict the next word, you're increasing the probability your system will give to the correct word, but you're also decreasing the probability it will give to the incorrect words. Now,
So when you minimize the cross entropy or whatever, when you train your LLM to predict the next word, you're increasing the probability your system will give to the correct word, but you're also decreasing the probability it will give to the incorrect words. Now,
So when you minimize the cross entropy or whatever, when you train your LLM to predict the next word, you're increasing the probability your system will give to the correct word, but you're also decreasing the probability it will give to the incorrect words. Now,
Indirectly, that gives a high probability to sequences of words that are good and low probability to sequences of words that are bad, but it's very indirect. It's not obvious why this actually works at all, because you're not doing it on a joint probability of all the symbols in a sequence.
Indirectly, that gives a high probability to sequences of words that are good and low probability to sequences of words that are bad, but it's very indirect. It's not obvious why this actually works at all, because you're not doing it on a joint probability of all the symbols in a sequence.
Indirectly, that gives a high probability to sequences of words that are good and low probability to sequences of words that are bad, but it's very indirect. It's not obvious why this actually works at all, because you're not doing it on a joint probability of all the symbols in a sequence.
You're just doing it to factorize that probability in terms of conditional probabilities over successive tokens.
You're just doing it to factorize that probability in terms of conditional probabilities over successive tokens.
You're just doing it to factorize that probability in terms of conditional probabilities over successive tokens.
So we've been doing this with Ojepa architectures, basically. The joint embedding. Ojepa. So there, the compatibility between two things is, here's an image or a video, here's a corrupted, shifted, or transformed version of that image or video, or masked. And then the energy of the system is the prediction error of the representation
So we've been doing this with Ojepa architectures, basically. The joint embedding. Ojepa. So there, the compatibility between two things is, here's an image or a video, here's a corrupted, shifted, or transformed version of that image or video, or masked. And then the energy of the system is the prediction error of the representation
So we've been doing this with Ojepa architectures, basically. The joint embedding. Ojepa. So there, the compatibility between two things is, here's an image or a video, here's a corrupted, shifted, or transformed version of that image or video, or masked. And then the energy of the system is the prediction error of the representation
the predicted representation of the good thing versus the actual representation of the good thing. So you run the corrupted image to the system, predict the representation of the good input, uncorrupted, and then compute the prediction error. That's the energy of the system. So this system will tell you this is a good representation
the predicted representation of the good thing versus the actual representation of the good thing. So you run the corrupted image to the system, predict the representation of the good input, uncorrupted, and then compute the prediction error. That's the energy of the system. So this system will tell you this is a good representation
the predicted representation of the good thing versus the actual representation of the good thing. So you run the corrupted image to the system, predict the representation of the good input, uncorrupted, and then compute the prediction error. That's the energy of the system. So this system will tell you this is a good representation
If this is a good image and this is a corrupted version, it will give you zero energy if those two things are effectively, one of them is a corrupted version of the other. It gives you a high energy if the two images are completely different.
If this is a good image and this is a corrupted version, it will give you zero energy if those two things are effectively, one of them is a corrupted version of the other. It gives you a high energy if the two images are completely different.
If this is a good image and this is a corrupted version, it will give you zero energy if those two things are effectively, one of them is a corrupted version of the other. It gives you a high energy if the two images are completely different.
And we know it does because then we use those representations as input to a classification system.