Sergey Levine

Fully autonomous robots are much closer than you think – Sergey Levine

So to give you a little bit of like a fanciful brain analogy, a VLM, a vision language model, is basically an LLM that has had a little like pseudo visual cortex grafted to it, a vision encoder, right?

1659.323 View full episode →

Dwarkesh Podcast

Fully autonomous robots are much closer than you think – Sergey Levine

So our models, they have a vision encoder, but they also have an action expert, an action decoder essentially.

1672.961 View full episode →

Dwarkesh Podcast

Fully autonomous robots are much closer than you think – Sergey Levine

So it has like a little visual cortex and notionally a little motor cortex.

1678.047 View full episode →

Dwarkesh Podcast

Fully autonomous robots are much closer than you think – Sergey Levine

And the way that the model actually makes decisions is it reads in the sensory information from the robot.

1681.934 View full episode →

Dwarkesh Podcast

Fully autonomous robots are much closer than you think – Sergey Levine

It does some internal processing and that could involve actually outputting intermediate steps.

1686.883 View full episode →

Dwarkesh Podcast

Fully autonomous robots are much closer than you think – Sergey Levine

Like you might tell it clean up the kitchen and it might think to itself like, hey, to clean up the kitchen, I need to pick up the dish and I need to pick up the sponge and I need to put this and this.

1691.271 View full episode →

Dwarkesh Podcast

Fully autonomous robots are much closer than you think – Sergey Levine

And then eventually it works its way through that chain of thought generation down to the action expert, which actually produces continuous actions.

1698.584 View full episode →

Dwarkesh Podcast

Fully autonomous robots are much closer than you think – Sergey Levine

And that has to be a different module because the actions are continuous.

1706.373 View full episode →

Dwarkesh Podcast

Fully autonomous robots are much closer than you think – Sergey Levine

They're high frequency, so they have a different data format than text tokens.

1710.078 View full episode →

Dwarkesh Podcast

Fully autonomous robots are much closer than you think – Sergey Levine

But structurally, it's still an end-to-end transformer.

1714.363 View full episode →

Dwarkesh Podcast

Fully autonomous robots are much closer than you think – Sergey Levine

And roughly speaking, technically, it corresponds to a kind of mixture of experts architecture.

1718.268 View full episode →

Dwarkesh Podcast

Fully autonomous robots are much closer than you think – Sergey Levine

That's right.

1739.89 View full episode →

Dwarkesh Podcast

Fully autonomous robots are much closer than you think – Sergey Levine

With the exception that the actions are actually not represented as discrete tokens.

1740.791 View full episode →

Dwarkesh Podcast

Fully autonomous robots are much closer than you think – Sergey Levine

It actually uses a flow matching kind of diffusion because they're continuous and you need to be very precise with your actions for dexterous control.

1744.534 View full episode →

Dwarkesh Podcast

Fully autonomous robots are much closer than you think – Sergey Levine

Yeah, so one theme here that like I think is important to keep in mind is that

1799.203 View full episode →

Dwarkesh Podcast

Fully autonomous robots are much closer than you think – Sergey Levine

The reason that those building blocks are so valuable is because the AI community has gotten a lot better at leveraging prior knowledge.

1804.272 View full episode →

Dwarkesh Podcast

Fully autonomous robots are much closer than you think – Sergey Levine

And a lot of what we're getting from the pre-trained LLMs and VLMs is prior knowledge about the world.