Sergey Levine
๐ค SpeakerAppearances Over Time
Podcast Appearances
So to give you a little bit of like a fanciful brain analogy, a VLM, a vision language model, is basically an LLM that has had a little like pseudo visual cortex grafted to it, a vision encoder, right?
So our models, they have a vision encoder, but they also have an action expert, an action decoder essentially.
So it has like a little visual cortex and notionally a little motor cortex.
And the way that the model actually makes decisions is it reads in the sensory information from the robot.
It does some internal processing and that could involve actually outputting intermediate steps.
Like you might tell it clean up the kitchen and it might think to itself like, hey, to clean up the kitchen, I need to pick up the dish and I need to pick up the sponge and I need to put this and this.
And then eventually it works its way through that chain of thought generation down to the action expert, which actually produces continuous actions.
And that has to be a different module because the actions are continuous.
They're high frequency, so they have a different data format than text tokens.
But structurally, it's still an end-to-end transformer.
And roughly speaking, technically, it corresponds to a kind of mixture of experts architecture.
That's right.
With the exception that the actions are actually not represented as discrete tokens.
It actually uses a flow matching kind of diffusion because they're continuous and you need to be very precise with your actions for dexterous control.
Yeah, so one theme here that like I think is important to keep in mind is that
The reason that those building blocks are so valuable is because the AI community has gotten a lot better at leveraging prior knowledge.
And a lot of what we're getting from the pre-trained LLMs and VLMs is prior knowledge about the world.
And it's a little bit abstracted knowledge.
You can identify objects.
You can figure out roughly where things are in an image, that sort of thing.