Yann LeCun
👤 PersonAppearances Over Time
Podcast Appearances
You turn them on you with images that are, you know, different versions or different views of the same thing. And you rely on some other tweaks to prevent the system from collapsing. And we have half a dozen different methods for this now.
You turn them on you with images that are, you know, different versions or different views of the same thing. And you rely on some other tweaks to prevent the system from collapsing. And we have half a dozen different methods for this now.
You turn them on you with images that are, you know, different versions or different views of the same thing. And you rely on some other tweaks to prevent the system from collapsing. And we have half a dozen different methods for this now.
Well, so it's a first step. So first of all, what's the difference with generative architectures like LLMs? So LLMs or vision systems that are trained by reconstruction generate the inputs. They generate the original input that is non-corrupted, non-transformed. So you have to predict all the pixels.
Well, so it's a first step. So first of all, what's the difference with generative architectures like LLMs? So LLMs or vision systems that are trained by reconstruction generate the inputs. They generate the original input that is non-corrupted, non-transformed. So you have to predict all the pixels.
Well, so it's a first step. So first of all, what's the difference with generative architectures like LLMs? So LLMs or vision systems that are trained by reconstruction generate the inputs. They generate the original input that is non-corrupted, non-transformed. So you have to predict all the pixels.
And there is a huge amount of resources spent in the system to actually predict all those pixels, all the details. In a JEPA, you're not trying to predict all the pixels. You're only trying to predict an abstract representation of the inputs, right? And that's much easier in many ways.
And there is a huge amount of resources spent in the system to actually predict all those pixels, all the details. In a JEPA, you're not trying to predict all the pixels. You're only trying to predict an abstract representation of the inputs, right? And that's much easier in many ways.
And there is a huge amount of resources spent in the system to actually predict all those pixels, all the details. In a JEPA, you're not trying to predict all the pixels. You're only trying to predict an abstract representation of the inputs, right? And that's much easier in many ways.
So what the JEPA system, when it's being trained, is trying to do is extract as much information as possible from the input, but yet only extract information that is relatively easily predictable. Okay. So there's a lot of things in the world that we cannot predict.
So what the JEPA system, when it's being trained, is trying to do is extract as much information as possible from the input, but yet only extract information that is relatively easily predictable. Okay. So there's a lot of things in the world that we cannot predict.
So what the JEPA system, when it's being trained, is trying to do is extract as much information as possible from the input, but yet only extract information that is relatively easily predictable. Okay. So there's a lot of things in the world that we cannot predict.
For example, if you have a self-driving car driving down the street or road, there may be trees around the road and it could be a windy day. So the leaves on the tree are kind of moving in kind of semi-chaotic random ways that you can't predict and you don't care. You don't want to predict. So what you want is your encoder to basically eliminate all those details.
For example, if you have a self-driving car driving down the street or road, there may be trees around the road and it could be a windy day. So the leaves on the tree are kind of moving in kind of semi-chaotic random ways that you can't predict and you don't care. You don't want to predict. So what you want is your encoder to basically eliminate all those details.
For example, if you have a self-driving car driving down the street or road, there may be trees around the road and it could be a windy day. So the leaves on the tree are kind of moving in kind of semi-chaotic random ways that you can't predict and you don't care. You don't want to predict. So what you want is your encoder to basically eliminate all those details.
It will tell you there's moving leaves, but it's not going to keep the details of exactly what's going on. And so when you do the prediction in representation space, you're not going to have to predict every single pixel of every leaf.
It will tell you there's moving leaves, but it's not going to keep the details of exactly what's going on. And so when you do the prediction in representation space, you're not going to have to predict every single pixel of every leaf.
It will tell you there's moving leaves, but it's not going to keep the details of exactly what's going on. And so when you do the prediction in representation space, you're not going to have to predict every single pixel of every leaf.
And that not only is a lot simpler, but also it allows the system to essentially learn an abstract representation of the world where what can be modeled and predicted is preserved, and the rest is viewed as noise and eliminated by the encoder. So it kind of lifts the level of abstraction of the representation. If you think about this, this is something we do absolutely all the time.
And that not only is a lot simpler, but also it allows the system to essentially learn an abstract representation of the world where what can be modeled and predicted is preserved, and the rest is viewed as noise and eliminated by the encoder. So it kind of lifts the level of abstraction of the representation. If you think about this, this is something we do absolutely all the time.