Yann LeCun
👤 PersonAppearances Over Time
Podcast Appearances
And then try to reconstruct the complete video or image from the corrupted version. And then hope that internally the system will develop good representations of images that you can use for object recognition, segmentation, whatever it is. That has been essentially a complete failure. And it works really well for text. That's the principle that is used for LLMs, right?
Okay, so the reason this doesn't work is, first of all, I have to tell you exactly what doesn't work because there is something else that does work. So the thing that does not work is training the system to learn representations of images by training it to reconstruct a good image from a corrupted version of it. That's what doesn't work.
Okay, so the reason this doesn't work is, first of all, I have to tell you exactly what doesn't work because there is something else that does work. So the thing that does not work is training the system to learn representations of images by training it to reconstruct a good image from a corrupted version of it. That's what doesn't work.
Okay, so the reason this doesn't work is, first of all, I have to tell you exactly what doesn't work because there is something else that does work. So the thing that does not work is training the system to learn representations of images by training it to reconstruct a good image from a corrupted version of it. That's what doesn't work.
And we have a whole slew of techniques for this that are a variant of denoising autoencoders. Something called MAE, developed by some of my colleagues at FAIR, masked autoencoder. So it's basically like the you know, LLMs or things like this, where you train the system by corrupting text, except you corrupt images, you remove patches from it, and you train a gigantic neural net to reconstruct.
And we have a whole slew of techniques for this that are a variant of denoising autoencoders. Something called MAE, developed by some of my colleagues at FAIR, masked autoencoder. So it's basically like the you know, LLMs or things like this, where you train the system by corrupting text, except you corrupt images, you remove patches from it, and you train a gigantic neural net to reconstruct.
And we have a whole slew of techniques for this that are a variant of denoising autoencoders. Something called MAE, developed by some of my colleagues at FAIR, masked autoencoder. So it's basically like the you know, LLMs or things like this, where you train the system by corrupting text, except you corrupt images, you remove patches from it, and you train a gigantic neural net to reconstruct.
The features you get are not good. And you know they're not good because if you now train the same architecture, but you train it supervised, with label data, with textual descriptions of images, et cetera, you do get good representations. And the performance on recognition tasks is much better than if you do this self-supervised pre-training. So the architecture is good. The architecture is good.
The features you get are not good. And you know they're not good because if you now train the same architecture, but you train it supervised, with label data, with textual descriptions of images, et cetera, you do get good representations. And the performance on recognition tasks is much better than if you do this self-supervised pre-training. So the architecture is good. The architecture is good.
The features you get are not good. And you know they're not good because if you now train the same architecture, but you train it supervised, with label data, with textual descriptions of images, et cetera, you do get good representations. And the performance on recognition tasks is much better than if you do this self-supervised pre-training. So the architecture is good. The architecture is good.
The architecture of the encoder is good. But the fact that you train the system to reconstruct images does not lead it to produce, to learn good generic features of images.
The architecture of the encoder is good. But the fact that you train the system to reconstruct images does not lead it to produce, to learn good generic features of images.
The architecture of the encoder is good. But the fact that you train the system to reconstruct images does not lead it to produce, to learn good generic features of images.
Self-supervised by reconstruction. Yeah, by reconstruction. Okay, so what's the alternative? The alternative is joint embedding. What is joint embedding?
Self-supervised by reconstruction. Yeah, by reconstruction. Okay, so what's the alternative? The alternative is joint embedding. What is joint embedding?
Self-supervised by reconstruction. Yeah, by reconstruction. Okay, so what's the alternative? The alternative is joint embedding. What is joint embedding?
Okay, so now instead of training a system to encode the image and then training it to reconstruct the full image from a corrupted version, you take the full image, you take the corrupted or transformed version. You run them both through encoders, which in general are identical, but not necessarily.
Okay, so now instead of training a system to encode the image and then training it to reconstruct the full image from a corrupted version, you take the full image, you take the corrupted or transformed version. You run them both through encoders, which in general are identical, but not necessarily.
Okay, so now instead of training a system to encode the image and then training it to reconstruct the full image from a corrupted version, you take the full image, you take the corrupted or transformed version. You run them both through encoders, which in general are identical, but not necessarily.
And then you train a predictor on top of those encoders to predict the representation of the full input from the representation of the corrupted one. So joint embedding, because you're taking the full input and the corrupted version, or transformed version, run them both through encoders, so you get a joint embedding.