Yann LeCun
👤 PersonAppearances Over Time
Podcast Appearances
And so self-supervised learning will not work as well.
And so self-supervised learning will not work as well.
Well, eventually, yes. But I think if we do this too early, we run the risk of being tempted to cheat. And in fact, that's what people are doing at the moment with vision language model. We're basically cheating. We're using language as a crutch to help the deficiencies of our vision systems to kind of learn good representations from images and video. And the problem with this is that
Well, eventually, yes. But I think if we do this too early, we run the risk of being tempted to cheat. And in fact, that's what people are doing at the moment with vision language model. We're basically cheating. We're using language as a crutch to help the deficiencies of our vision systems to kind of learn good representations from images and video. And the problem with this is that
Well, eventually, yes. But I think if we do this too early, we run the risk of being tempted to cheat. And in fact, that's what people are doing at the moment with vision language model. We're basically cheating. We're using language as a crutch to help the deficiencies of our vision systems to kind of learn good representations from images and video. And the problem with this is that
We might improve our vision language system a bit, I mean, our language models by feeding them images, but we're not going to get to the level of even the intelligence or level of understanding of the world of a cat or a dog, which doesn't have language. They don't have language, and they understand the world much better than any LLM.
We might improve our vision language system a bit, I mean, our language models by feeding them images, but we're not going to get to the level of even the intelligence or level of understanding of the world of a cat or a dog, which doesn't have language. They don't have language, and they understand the world much better than any LLM.
We might improve our vision language system a bit, I mean, our language models by feeding them images, but we're not going to get to the level of even the intelligence or level of understanding of the world of a cat or a dog, which doesn't have language. They don't have language, and they understand the world much better than any LLM.
They can plan really complex actions and sort of imagine the result of a bunch of actions, How do we get machines to learn that before we combine that with language? Obviously, if we combine this with language, this is going to be a winner. But before that, we have to focus on how do we get systems to learn how the world works.
They can plan really complex actions and sort of imagine the result of a bunch of actions, How do we get machines to learn that before we combine that with language? Obviously, if we combine this with language, this is going to be a winner. But before that, we have to focus on how do we get systems to learn how the world works.
They can plan really complex actions and sort of imagine the result of a bunch of actions, How do we get machines to learn that before we combine that with language? Obviously, if we combine this with language, this is going to be a winner. But before that, we have to focus on how do we get systems to learn how the world works.
That's the hope. In fact, the techniques we're using are non-contrastive. So not only is the architecture non-generative, the learning procedures we're using are non-contrastive. We have two sets of techniques. One set is based on distillation, and there's a number of methods that use this principle. One by DeepMind called BYOL, a couple by FAIR, one called VicReg, and another one called IGEPA.
That's the hope. In fact, the techniques we're using are non-contrastive. So not only is the architecture non-generative, the learning procedures we're using are non-contrastive. We have two sets of techniques. One set is based on distillation, and there's a number of methods that use this principle. One by DeepMind called BYOL, a couple by FAIR, one called VicReg, and another one called IGEPA.
That's the hope. In fact, the techniques we're using are non-contrastive. So not only is the architecture non-generative, the learning procedures we're using are non-contrastive. We have two sets of techniques. One set is based on distillation, and there's a number of methods that use this principle. One by DeepMind called BYOL, a couple by FAIR, one called VicReg, and another one called IGEPA.
And vcrag, I should say, is not a distillation method, actually, but ijpa and BYOL certainly are. And there's another one also called dino, also produced at FAIR. And the idea of those things is that you take the full input, let's say an image, you run it through an encoder, produces a representation.
And vcrag, I should say, is not a distillation method, actually, but ijpa and BYOL certainly are. And there's another one also called dino, also produced at FAIR. And the idea of those things is that you take the full input, let's say an image, you run it through an encoder, produces a representation.
And vcrag, I should say, is not a distillation method, actually, but ijpa and BYOL certainly are. And there's another one also called dino, also produced at FAIR. And the idea of those things is that you take the full input, let's say an image, you run it through an encoder, produces a representation.
And then you corrupt that input or transform it, running through essentially what amounts to the same encoder, with some minor differences. And then train a predictor, sometimes the predictor is very simple, sometimes it doesn't exist, but train a predictor to predict a representation of the first uncorrupted input from the corrupted input. But you only train the second branch.
And then you corrupt that input or transform it, running through essentially what amounts to the same encoder, with some minor differences. And then train a predictor, sometimes the predictor is very simple, sometimes it doesn't exist, but train a predictor to predict a representation of the first uncorrupted input from the corrupted input. But you only train the second branch.
And then you corrupt that input or transform it, running through essentially what amounts to the same encoder, with some minor differences. And then train a predictor, sometimes the predictor is very simple, sometimes it doesn't exist, but train a predictor to predict a representation of the first uncorrupted input from the corrupted input. But you only train the second branch.