Yann LeCun
👤 PersonAppearances Over Time
Podcast Appearances
And what we mask is actually kind of a temporal tube. So like a whole segment of each frame in the video over the entire video. Mm-hmm.
Throughout the tube, yeah. Typically it's 16 frames or something, and we masked the same region over the entire 16 frames. It's a different one for every video, obviously. And then again, train that system so as to predict the representation of the full video from the partially masked video. That works really well.
Throughout the tube, yeah. Typically it's 16 frames or something, and we masked the same region over the entire 16 frames. It's a different one for every video, obviously. And then again, train that system so as to predict the representation of the full video from the partially masked video. That works really well.
Throughout the tube, yeah. Typically it's 16 frames or something, and we masked the same region over the entire 16 frames. It's a different one for every video, obviously. And then again, train that system so as to predict the representation of the full video from the partially masked video. That works really well.
It's the first system that we have that learns good representations of video so that when you feed those representations to a supervised classifier head, it can tell you what action is taking place in the video with pretty good accuracy. So that's the first time we get something of that quality.
It's the first system that we have that learns good representations of video so that when you feed those representations to a supervised classifier head, it can tell you what action is taking place in the video with pretty good accuracy. So that's the first time we get something of that quality.
It's the first system that we have that learns good representations of video so that when you feed those representations to a supervised classifier head, it can tell you what action is taking place in the video with pretty good accuracy. So that's the first time we get something of that quality.
Yeah. We have also preliminary results that seem to indicate that the representation allows our system to tell whether the video is physically possible or completely impossible because some object disappeared or an object suddenly jumped from one location to another or changed shape or something.
Yeah. We have also preliminary results that seem to indicate that the representation allows our system to tell whether the video is physically possible or completely impossible because some object disappeared or an object suddenly jumped from one location to another or changed shape or something.
Yeah. We have also preliminary results that seem to indicate that the representation allows our system to tell whether the video is physically possible or completely impossible because some object disappeared or an object suddenly jumped from one location to another or changed shape or something.
Possibly. This is going to take a while before we get to that point, but there are robotic systems that are based on this idea. And what you need for this is a slightly modified version of this, where imagine that you have a complete video.
Possibly. This is going to take a while before we get to that point, but there are robotic systems that are based on this idea. And what you need for this is a slightly modified version of this, where imagine that you have a complete video.
Possibly. This is going to take a while before we get to that point, but there are robotic systems that are based on this idea. And what you need for this is a slightly modified version of this, where imagine that you have a complete video.
And what you're doing to this video is that you're either translating it in time towards the future, so you only see the beginning of the video, but you don't see the latter part of it that is in the original one. Or you just mask the second half of the video, for example. And then you train a JEPA system of the type I described to predict the representation of the full video from the shifted one.
And what you're doing to this video is that you're either translating it in time towards the future, so you only see the beginning of the video, but you don't see the latter part of it that is in the original one. Or you just mask the second half of the video, for example. And then you train a JEPA system of the type I described to predict the representation of the full video from the shifted one.
And what you're doing to this video is that you're either translating it in time towards the future, so you only see the beginning of the video, but you don't see the latter part of it that is in the original one. Or you just mask the second half of the video, for example. And then you train a JEPA system of the type I described to predict the representation of the full video from the shifted one.
But you also feed the predictor with an action. For example, the wheel is turned 10 degrees to the right or something. So if it's a dash cam in a car and you know the angle of the wheel, you should be able to predict to some extent what's going to happen to what you see.
But you also feed the predictor with an action. For example, the wheel is turned 10 degrees to the right or something. So if it's a dash cam in a car and you know the angle of the wheel, you should be able to predict to some extent what's going to happen to what you see.
But you also feed the predictor with an action. For example, the wheel is turned 10 degrees to the right or something. So if it's a dash cam in a car and you know the angle of the wheel, you should be able to predict to some extent what's going to happen to what you see.
You're not going to be able to predict all the details of objects that appear in the view, obviously, but at an abstract representation level, you can probably predict what's going to happen. So now what you have is...