Yann LeCun
๐ค SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
Well, so it's a first step. So first of all, what's the difference with generative architectures like LLMs? So LLMs or vision systems that are trained by reconstruction generate the inputs. They generate the original input that is non-corrupted, non-transformed. So you have to predict all the pixels.
Well, so it's a first step. So first of all, what's the difference with generative architectures like LLMs? So LLMs or vision systems that are trained by reconstruction generate the inputs. They generate the original input that is non-corrupted, non-transformed. So you have to predict all the pixels.
And there is a huge amount of resources spent in the system to actually predict all those pixels, all the details. In a JEPA, you're not trying to predict all the pixels. You're only trying to predict an abstract representation of the inputs, right? And that's much easier in many ways.
And there is a huge amount of resources spent in the system to actually predict all those pixels, all the details. In a JEPA, you're not trying to predict all the pixels. You're only trying to predict an abstract representation of the inputs, right? And that's much easier in many ways.
And there is a huge amount of resources spent in the system to actually predict all those pixels, all the details. In a JEPA, you're not trying to predict all the pixels. You're only trying to predict an abstract representation of the inputs, right? And that's much easier in many ways.
So what the JEPA system, when it's being trained, is trying to do is extract as much information as possible from the input, but yet only extract information that is relatively easily predictable. Okay. So there's a lot of things in the world that we cannot predict.
So what the JEPA system, when it's being trained, is trying to do is extract as much information as possible from the input, but yet only extract information that is relatively easily predictable. Okay. So there's a lot of things in the world that we cannot predict.
So what the JEPA system, when it's being trained, is trying to do is extract as much information as possible from the input, but yet only extract information that is relatively easily predictable. Okay. So there's a lot of things in the world that we cannot predict.
For example, if you have a self-driving car driving down the street or road, there may be trees around the road and it could be a windy day. So the leaves on the tree are kind of moving in kind of semi-chaotic random ways that you can't predict and you don't care. You don't want to predict. So what you want is your encoder to basically eliminate all those details.
For example, if you have a self-driving car driving down the street or road, there may be trees around the road and it could be a windy day. So the leaves on the tree are kind of moving in kind of semi-chaotic random ways that you can't predict and you don't care. You don't want to predict. So what you want is your encoder to basically eliminate all those details.
For example, if you have a self-driving car driving down the street or road, there may be trees around the road and it could be a windy day. So the leaves on the tree are kind of moving in kind of semi-chaotic random ways that you can't predict and you don't care. You don't want to predict. So what you want is your encoder to basically eliminate all those details.
It will tell you there's moving leaves, but it's not going to keep the details of exactly what's going on. And so when you do the prediction in representation space, you're not going to have to predict every single pixel of every leaf.
It will tell you there's moving leaves, but it's not going to keep the details of exactly what's going on. And so when you do the prediction in representation space, you're not going to have to predict every single pixel of every leaf.
It will tell you there's moving leaves, but it's not going to keep the details of exactly what's going on. And so when you do the prediction in representation space, you're not going to have to predict every single pixel of every leaf.
And that not only is a lot simpler, but also it allows the system to essentially learn an abstract representation of the world where what can be modeled and predicted is preserved, and the rest is viewed as noise and eliminated by the encoder. So it kind of lifts the level of abstraction of the representation. If you think about this, this is something we do absolutely all the time.
And that not only is a lot simpler, but also it allows the system to essentially learn an abstract representation of the world where what can be modeled and predicted is preserved, and the rest is viewed as noise and eliminated by the encoder. So it kind of lifts the level of abstraction of the representation. If you think about this, this is something we do absolutely all the time.
And that not only is a lot simpler, but also it allows the system to essentially learn an abstract representation of the world where what can be modeled and predicted is preserved, and the rest is viewed as noise and eliminated by the encoder. So it kind of lifts the level of abstraction of the representation. If you think about this, this is something we do absolutely all the time.
Whenever we describe a phenomenon, we describe it at a particular level of abstraction. We don't always describe every natural phenomenon in terms of quantum field theory. That would be impossible.
Whenever we describe a phenomenon, we describe it at a particular level of abstraction. We don't always describe every natural phenomenon in terms of quantum field theory. That would be impossible.
Whenever we describe a phenomenon, we describe it at a particular level of abstraction. We don't always describe every natural phenomenon in terms of quantum field theory. That would be impossible.