Yann LeCun
๐ค SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
We have multiple levels of abstraction to describe what happens in the world, starting from quantum field theory to atomic theory and molecules and chemistry, materials, and all the way up to concrete objects in the real world and things like that. We can't just only model everything at the lowest level. That's what the idea of JEPA is really about.
We have multiple levels of abstraction to describe what happens in the world, starting from quantum field theory to atomic theory and molecules and chemistry, materials, and all the way up to concrete objects in the real world and things like that. We can't just only model everything at the lowest level. That's what the idea of JEPA is really about.
We have multiple levels of abstraction to describe what happens in the world, starting from quantum field theory to atomic theory and molecules and chemistry, materials, and all the way up to concrete objects in the real world and things like that. We can't just only model everything at the lowest level. That's what the idea of JEPA is really about.
Learn abstract representation in a self-supervised manner. You can do it hierarchically as well. That, I think, is an essential component of an intelligent system. In language, we can get away without doing this because language is already, to some level, abstract, and already has eliminated a lot of information that is not predictable.
Learn abstract representation in a self-supervised manner. You can do it hierarchically as well. That, I think, is an essential component of an intelligent system. In language, we can get away without doing this because language is already, to some level, abstract, and already has eliminated a lot of information that is not predictable.
Learn abstract representation in a self-supervised manner. You can do it hierarchically as well. That, I think, is an essential component of an intelligent system. In language, we can get away without doing this because language is already, to some level, abstract, and already has eliminated a lot of information that is not predictable.
So we can get away without doing the chanter embedding, without lifting the abstraction level, and by directly predicting words.
So we can get away without doing the chanter embedding, without lifting the abstraction level, and by directly predicting words.
So we can get away without doing the chanter embedding, without lifting the abstraction level, and by directly predicting words.
Right. And the thing is, those self-supervised algorithms that learn by prediction, even in representation space, they learn more concepts if the input data you feed them is more redundant. The more redundancy there is in the data, the more they're able to capture some internal structure of it.
Right. And the thing is, those self-supervised algorithms that learn by prediction, even in representation space, they learn more concepts if the input data you feed them is more redundant. The more redundancy there is in the data, the more they're able to capture some internal structure of it.
Right. And the thing is, those self-supervised algorithms that learn by prediction, even in representation space, they learn more concepts if the input data you feed them is more redundant. The more redundancy there is in the data, the more they're able to capture some internal structure of it.
And so there, there is way more redundancy and structure in perceptual inputs, sensory input, like vision, than there is in text, which is not nearly as redundant. This is back to the question you were asking. a few minutes ago. Language might represent more information really because it's already compressed. You're right about that, but that means it's also less redundant.
And so there, there is way more redundancy and structure in perceptual inputs, sensory input, like vision, than there is in text, which is not nearly as redundant. This is back to the question you were asking. a few minutes ago. Language might represent more information really because it's already compressed. You're right about that, but that means it's also less redundant.
And so there, there is way more redundancy and structure in perceptual inputs, sensory input, like vision, than there is in text, which is not nearly as redundant. This is back to the question you were asking. a few minutes ago. Language might represent more information really because it's already compressed. You're right about that, but that means it's also less redundant.
And so self-supervised learning will not work as well.
And so self-supervised learning will not work as well.
And so self-supervised learning will not work as well.
Well, eventually, yes. But I think if we do this too early, we run the risk of being tempted to cheat. And in fact, that's what people are doing at the moment with vision language model. We're basically cheating. We're using language as a crutch to help the deficiencies of our vision systems to kind of learn good representations from images and video. And the problem with this is that
Well, eventually, yes. But I think if we do this too early, we run the risk of being tempted to cheat. And in fact, that's what people are doing at the moment with vision language model. We're basically cheating. We're using language as a crutch to help the deficiencies of our vision systems to kind of learn good representations from images and video. And the problem with this is that