Yann LeCun
👤 PersonAppearances Over Time
Podcast Appearances
We do not know how to represent distributions over high-dimensional continuous spaces in ways that are useful. And there lies the main issue. And the reason we can do this is because the world is incredibly more complicated and richer in terms of information than text. Text is discrete. Video is highly dimensional and continuous. A lot of details in this.
We do not know how to represent distributions over high-dimensional continuous spaces in ways that are useful. And there lies the main issue. And the reason we can do this is because the world is incredibly more complicated and richer in terms of information than text. Text is discrete. Video is highly dimensional and continuous. A lot of details in this.
We do not know how to represent distributions over high-dimensional continuous spaces in ways that are useful. And there lies the main issue. And the reason we can do this is because the world is incredibly more complicated and richer in terms of information than text. Text is discrete. Video is highly dimensional and continuous. A lot of details in this.
So if I take a video of this room, and the video is a camera panning around, there is no way I can predict everything that's going to be in the room as I pan around. The system cannot predict what's going to be in the room as the camera is panning. Maybe it's going to predict this is a room where there's a light and there is a wall and things like that.
So if I take a video of this room, and the video is a camera panning around, there is no way I can predict everything that's going to be in the room as I pan around. The system cannot predict what's going to be in the room as the camera is panning. Maybe it's going to predict this is a room where there's a light and there is a wall and things like that.
So if I take a video of this room, and the video is a camera panning around, there is no way I can predict everything that's going to be in the room as I pan around. The system cannot predict what's going to be in the room as the camera is panning. Maybe it's going to predict this is a room where there's a light and there is a wall and things like that.
It can't predict what the painting on the wall looks like or what the texture of the couch looks like. Certainly not the texture of the carpet. So there's no way it can predict all those details. So the way to handle this
It can't predict what the painting on the wall looks like or what the texture of the couch looks like. Certainly not the texture of the carpet. So there's no way it can predict all those details. So the way to handle this
It can't predict what the painting on the wall looks like or what the texture of the couch looks like. Certainly not the texture of the carpet. So there's no way it can predict all those details. So the way to handle this
is one way possibly to handle this, which we've been working for a long time, is to have a model that has what's called a latent variable, and the latent variable is fed to a neural net, and it's supposed to represent all the information about the world that you don't perceive yet, and that you need to augment the system for the prediction to do a good job at predicting pixels, including the fine texture of the
is one way possibly to handle this, which we've been working for a long time, is to have a model that has what's called a latent variable, and the latent variable is fed to a neural net, and it's supposed to represent all the information about the world that you don't perceive yet, and that you need to augment the system for the prediction to do a good job at predicting pixels, including the fine texture of the
is one way possibly to handle this, which we've been working for a long time, is to have a model that has what's called a latent variable, and the latent variable is fed to a neural net, and it's supposed to represent all the information about the world that you don't perceive yet, and that you need to augment the system for the prediction to do a good job at predicting pixels, including the fine texture of the
the carpet and the couch, and the painting on the wall. That has been a complete failure, essentially. And we've tried lots of things. We tried just straight neural nets, we tried GANs, we tried VAEs, all kinds of regularized autoencoders, we tried many things.
the carpet and the couch, and the painting on the wall. That has been a complete failure, essentially. And we've tried lots of things. We tried just straight neural nets, we tried GANs, we tried VAEs, all kinds of regularized autoencoders, we tried many things.
the carpet and the couch, and the painting on the wall. That has been a complete failure, essentially. And we've tried lots of things. We tried just straight neural nets, we tried GANs, we tried VAEs, all kinds of regularized autoencoders, we tried many things.
We also tried those kind of methods to learn good representations of images or video that could then be used as input to, for example, an image classification system. And that also has basically failed. All the systems that attempt to predict missing parts of an image or video form a corrupted version of it, basically. So I take an image or a video, corrupt it or transform it in some way,
We also tried those kind of methods to learn good representations of images or video that could then be used as input to, for example, an image classification system. And that also has basically failed. All the systems that attempt to predict missing parts of an image or video form a corrupted version of it, basically. So I take an image or a video, corrupt it or transform it in some way,
We also tried those kind of methods to learn good representations of images or video that could then be used as input to, for example, an image classification system. And that also has basically failed. All the systems that attempt to predict missing parts of an image or video form a corrupted version of it, basically. So I take an image or a video, corrupt it or transform it in some way,
And then try to reconstruct the complete video or image from the corrupted version. And then hope that internally the system will develop good representations of images that you can use for object recognition, segmentation, whatever it is. That has been essentially a complete failure. And it works really well for text. That's the principle that is used for LLMs, right?
And then try to reconstruct the complete video or image from the corrupted version. And then hope that internally the system will develop good representations of images that you can use for object recognition, segmentation, whatever it is. That has been essentially a complete failure. And it works really well for text. That's the principle that is used for LLMs, right?